WEBVTT
00:00:00.080 --> 00:00:01.770
The following
content is provided
00:00:01.770 --> 00:00:04.010
under a Creative
Commons license.
00:00:04.010 --> 00:00:06.860
Your support will help MIT
OpenCourseWare continue
00:00:06.860 --> 00:00:10.720
to offer high quality
educational resources for free.
00:00:10.720 --> 00:00:13.330
To make a donation or
view additional materials
00:00:13.330 --> 00:00:17.228
from hundreds of MIT courses,
visit MIT OpenCourseWare
00:00:17.228 --> 00:00:17.853
at ocw.mit.edu.
00:00:23.290 --> 00:00:27.010
PROFESSOR: So today's
lecture is on sorting.
00:00:27.010 --> 00:00:31.310
We'll be talking about specific
sorting algorithms today.
00:00:31.310 --> 00:00:34.570
I want to start
by motivating why
00:00:34.570 --> 00:00:37.105
we're interested in sorting,
which should be fairly easy.
00:00:39.690 --> 00:00:42.490
Then I want to discuss
a particular sorting
00:00:42.490 --> 00:00:45.470
algorithm that's
called insertion sort.
00:00:45.470 --> 00:00:47.860
That's probably the
simplest sorting algorithm
00:00:47.860 --> 00:00:51.942
you can write, it's
five lines of code.
00:00:51.942 --> 00:00:53.400
It's not the best
sorting algorithm
00:00:53.400 --> 00:00:57.430
that's out there and so
we'll try and improve it.
00:00:57.430 --> 00:01:00.470
We'll also talk about merge
sort, which is a divide
00:01:00.470 --> 00:01:03.250
and conquer algorithm
and that's going
00:01:03.250 --> 00:01:09.010
to motivate the last thing
that I want to spend time on,
00:01:09.010 --> 00:01:12.300
which is recurrences and
how you solve recurrences.
00:01:12.300 --> 00:01:14.230
Typically the
recurrences that we'll
00:01:14.230 --> 00:01:18.120
be looking at in double o six
are going to come from divide
00:01:18.120 --> 00:01:20.760
and conquer problems
like merge sort
00:01:20.760 --> 00:01:24.300
but you'll see
this over and over.
00:01:24.300 --> 00:01:26.965
So let's talk about why
we're interested in sorting.
00:01:30.110 --> 00:01:34.370
There's some fairly
obvious applications
00:01:34.370 --> 00:01:37.490
like if you want to
maintain a phone book,
00:01:37.490 --> 00:01:42.010
you've got a bunch of names
and numbers corresponding
00:01:42.010 --> 00:01:44.830
to a telephone
directory and you want
00:01:44.830 --> 00:01:46.330
to keep them in
sorted order so it's
00:01:46.330 --> 00:01:51.510
easy to search, mp3 organizers,
spreadsheets, et cetera.
00:01:51.510 --> 00:01:54.590
So there's lots of
obvious applications.
00:01:54.590 --> 00:01:59.630
There's also some
interesting problems
00:01:59.630 --> 00:02:05.900
that become easy once
items are sorted.
00:02:13.790 --> 00:02:18.170
One example of that
is finding a median.
00:02:23.860 --> 00:02:27.220
So let's say that you
have a bunch of items
00:02:27.220 --> 00:02:36.210
in an array a zero through
n and a zero through n
00:02:36.210 --> 00:02:38.650
contains n numbers and
they're not sorted.
00:02:44.850 --> 00:02:48.950
When you sort, you
turn this into b 0
00:02:48.950 --> 00:02:52.760
through n, where if
it's just numbers, then
00:02:52.760 --> 00:02:55.660
you may sort them in increasing
order or decreasing order.
00:02:55.660 --> 00:02:58.690
Let's just call it
increasing order for now.
00:02:58.690 --> 00:03:01.650
Or if they're records,
and they're not numbers,
00:03:01.650 --> 00:03:04.520
then you have to provide
a comparison function
00:03:04.520 --> 00:03:08.050
to determine which record is
smaller than another record.
00:03:08.050 --> 00:03:09.520
And that's another
input that you
00:03:09.520 --> 00:03:12.980
have to have in order
to do the sorting.
00:03:12.980 --> 00:03:15.310
So it doesn't really
matter what the items are
00:03:15.310 --> 00:03:17.580
as long as you have the
comparison function.
00:03:17.580 --> 00:03:19.930
Think of it as less
than or equal to.
00:03:19.930 --> 00:03:23.750
And if you have that and
it's straightforward,
00:03:23.750 --> 00:03:27.090
obviously, to check that 3
is less than 4, et cetera.
00:03:27.090 --> 00:03:29.640
But it may be a little
more complicated
00:03:29.640 --> 00:03:32.990
for more sophisticated
sorting applications.
00:03:32.990 --> 00:03:36.570
But the bottom line is that if
you have your algorithm that
00:03:36.570 --> 00:03:39.120
takes a comparison
function as an input,
00:03:39.120 --> 00:03:42.670
you're going to be able to,
after a certain amount of time,
00:03:42.670 --> 00:03:45.100
get B 0 n.
00:03:45.100 --> 00:03:48.680
Now if you wanted to find the
median of the set of numbers
00:03:48.680 --> 00:03:51.720
that were originally
in the array A,
00:03:51.720 --> 00:03:56.090
what would you do once you
have the sorted array B?
00:03:56.090 --> 00:03:59.124
AUDIENCE: Isn't there a more
efficient algorithm for median?
00:03:59.124 --> 00:04:00.040
PROFESSOR: Absolutely.
00:04:00.040 --> 00:04:08.120
But this is sort of a side
effect of having a sorted list.
00:04:08.120 --> 00:04:10.240
If you happen to
have a sorted list,
00:04:10.240 --> 00:04:15.090
there's many ways
that you could imagine
00:04:15.090 --> 00:04:16.310
building up a sorted list.
00:04:16.310 --> 00:04:19.779
One way is you have something
that's completely unsorted
00:04:19.779 --> 00:04:22.620
and you run insertion
sort or merge sort.
00:04:22.620 --> 00:04:25.140
Another way would be to
maintain a sorted list as you're
00:04:25.140 --> 00:04:27.660
getting items put into the list.
00:04:27.660 --> 00:04:29.640
So if you happened
to have a sorted list
00:04:29.640 --> 00:04:32.600
and you need to have this
sorted list for some reason,
00:04:32.600 --> 00:04:35.440
the point I'm making here
is that finding the median
00:04:35.440 --> 00:04:37.140
is easy.
00:04:37.140 --> 00:04:39.430
And it's easy because
all you have to do
00:04:39.430 --> 00:04:43.805
is look at-- depending
on whether n is odd
00:04:43.805 --> 00:04:47.527
or even-- look at B of n over 2.
00:04:47.527 --> 00:04:49.360
That would give you the
median because you'd
00:04:49.360 --> 00:04:54.210
have a bunch of numbers
that are less than that
00:04:54.210 --> 00:04:56.830
and the equal set of numbers
that are greater than that,
00:04:56.830 --> 00:04:59.770
which is the
definition of median.
00:04:59.770 --> 00:05:05.030
So this is not necessarily the
best way, as you pointed out,
00:05:05.030 --> 00:05:06.400
of finding the median.
00:05:06.400 --> 00:05:11.320
But it's constant time if
you have a sorted list.
00:05:11.320 --> 00:05:14.650
That's the point
I wanted to make.
00:05:14.650 --> 00:05:16.720
There are other things
that you could do.
00:05:16.720 --> 00:05:20.780
And this came up
in Erik's lecture,
00:05:20.780 --> 00:05:25.570
which is the notion of
binary search-- finding
00:05:25.570 --> 00:05:28.650
an element in an array--
a specific element.
00:05:28.650 --> 00:05:34.090
You have a list of items--
again a 0 through n.
00:05:34.090 --> 00:05:39.600
And you're looking for a
specific number or item.
00:05:43.550 --> 00:05:46.640
You could, obviously,
scan the array,
00:05:46.640 --> 00:05:50.260
and that would take you
linear time to find this item.
00:05:50.260 --> 00:05:53.100
If the array happened
to be sorted,
00:05:53.100 --> 00:05:58.530
then you can find this
in logarithmic time
00:05:58.530 --> 00:06:00.295
using what's called
binary search.
00:06:03.600 --> 00:06:05.880
Let's say you're looking
for a specific item.
00:06:05.880 --> 00:06:08.280
Let's call it k.
00:06:08.280 --> 00:06:11.140
Binary search, roughly
speaking, would
00:06:11.140 --> 00:06:20.200
work like-- you go compare
k to, again, B of n over 2,
00:06:20.200 --> 00:06:23.780
and decide, given
that B is sorted,
00:06:23.780 --> 00:06:28.400
you get to look at
1/2 of the array.
00:06:28.400 --> 00:06:33.080
If B of n over 2 is not
exactly k, then-- well,
00:06:33.080 --> 00:06:34.390
if it's exactly k you're done.
00:06:34.390 --> 00:06:36.770
Otherwise, you look
at the left half.
00:06:36.770 --> 00:06:39.670
You do your divide
and conquer paradigm.
00:06:39.670 --> 00:06:42.820
And you can do this
in logarithmic time.
00:06:42.820 --> 00:06:45.700
So keep this in mind,
because binary search
00:06:45.700 --> 00:06:48.530
is going to come up
in today's lecture
00:06:48.530 --> 00:06:50.760
and again in other lectures.
00:06:50.760 --> 00:06:53.750
It's really a great
paradigm of divide
00:06:53.750 --> 00:06:56.020
and conquer--
probably the simplest.
00:06:56.020 --> 00:06:57.690
And it, essentially,
takes something
00:06:57.690 --> 00:07:01.040
that's linear--
a linear search--
00:07:01.040 --> 00:07:03.770
and turns it into
logarithmic search.
00:07:03.770 --> 00:07:06.540
So those are a
couple of problems
00:07:06.540 --> 00:07:10.950
that become easy if
you have a sorted list.
00:07:10.950 --> 00:07:21.270
And there's some not
so obvious applications
00:07:21.270 --> 00:07:25.150
of sorting-- for example,
data compression.
00:07:25.150 --> 00:07:27.790
If you wanted to
compress a file,
00:07:27.790 --> 00:07:30.530
one of the things that
you could do is to--
00:07:30.530 --> 00:07:35.330
and it's a set of items--
you could sort the items.
00:07:35.330 --> 00:07:37.870
And that automatically
finds duplicates.
00:07:37.870 --> 00:07:42.940
And you could say, if I have 100
items that are all identical,
00:07:42.940 --> 00:07:47.779
I'm going to compress the file
by representing the item once
00:07:47.779 --> 00:07:49.320
and, then, having
a number associated
00:07:49.320 --> 00:07:52.770
with the frequency of that
item-- similar to what
00:07:52.770 --> 00:07:54.440
document distance does.
00:07:54.440 --> 00:07:57.750
Document distance can
be viewed as a way
00:07:57.750 --> 00:07:59.770
of compressing
your initial input.
00:07:59.770 --> 00:08:03.240
Obviously, you lose the works of
Shakespeare or whatever it was.
00:08:03.240 --> 00:08:06.560
And it becomes a bunch
of words and frequencies.
00:08:06.560 --> 00:08:12.870
But it is something that
compresses the input
00:08:12.870 --> 00:08:15.590
and gives you a
different representation.
00:08:15.590 --> 00:08:20.395
And so people use sorting as a
subroutine in data compression.
00:08:23.190 --> 00:08:27.360
Computer graphics uses sorting.
00:08:27.360 --> 00:08:30.560
Most of the time,
when you render
00:08:30.560 --> 00:08:32.870
scenes in computer graphics,
you have many layers
00:08:32.870 --> 00:08:34.559
corresponding to the scenes.
00:08:34.559 --> 00:08:38.550
It turns out that,
in computer graphics,
00:08:38.550 --> 00:08:40.299
most of the time you're
actually rendering
00:08:40.299 --> 00:08:44.410
front to back because,
when you have a big opaque
00:08:44.410 --> 00:08:48.887
object in front, you want
to render that first,
00:08:48.887 --> 00:08:50.970
so you don't have to worry
about everything that's
00:08:50.970 --> 00:08:54.060
occluded by this
big opaque object.
00:08:54.060 --> 00:08:56.590
And that makes things
more efficient.
00:08:56.590 --> 00:08:58.700
And so you keep things
sorted front to back,
00:08:58.700 --> 00:09:01.160
most of the time, in
computer graphics rendering.
00:09:01.160 --> 00:09:03.860
But some of the time, if you're
worried about transparency,
00:09:03.860 --> 00:09:05.660
you have to render
things back to front.
00:09:05.660 --> 00:09:08.390
So typically, you
have sorted lists
00:09:08.390 --> 00:09:11.550
corresponding to the different
objects in both orders--
00:09:11.550 --> 00:09:13.730
both increasing order
and decreasing order.
00:09:13.730 --> 00:09:15.230
And you're maintaining that.
00:09:15.230 --> 00:09:19.190
So sorting is a real
important subroutine
00:09:19.190 --> 00:09:23.090
in pretty much any sophisticated
application you look at.
00:09:23.090 --> 00:09:26.780
So it's worthwhile to look
at the variety of sorting
00:09:26.780 --> 00:09:28.350
algorithms that are out there.
00:09:28.350 --> 00:09:30.432
And we're going to do
some simple ones, today.
00:09:30.432 --> 00:09:31.890
But if you go and
look at Wikipedia
00:09:31.890 --> 00:09:35.270
and do a Google search,
there's all sorts
00:09:35.270 --> 00:09:38.030
of sorts like cocktail
sort, and bitonic sort,
00:09:38.030 --> 00:09:41.900
and what have you.
00:09:41.900 --> 00:09:45.900
And there's reasons why each of
these sorting algorithms exist.
00:09:45.900 --> 00:09:49.830
Because in specific
cases, they end up
00:09:49.830 --> 00:09:53.055
winning on types of inputs
or types of problems.
00:09:55.660 --> 00:09:59.470
So let's take a look at our
first sorting algorithm.
00:09:59.470 --> 00:10:03.640
I'm not going to write code
but it will be in the notes.
00:10:03.640 --> 00:10:08.860
And it is in your document
distance Python files.
00:10:08.860 --> 00:10:10.770
But I'll just give
you pseudocode here
00:10:10.770 --> 00:10:13.750
and walk through what
insertion sort looks like
00:10:13.750 --> 00:10:17.460
because the purpose
of describing
00:10:17.460 --> 00:10:20.756
this algorithm to you is
to analyze its complexity.
00:10:20.756 --> 00:10:22.130
We need to do some
counting here,
00:10:22.130 --> 00:10:25.230
with respect to this
algorithm, to figure out
00:10:25.230 --> 00:10:28.610
how fast it's going to run
in and what the worst case
00:10:28.610 --> 00:10:30.280
complexity is.
00:10:30.280 --> 00:10:32.585
So what is insertion sort?
00:10:32.585 --> 00:10:41.780
For i equals 1, 2, through n,
given an input to be sorted,
00:10:41.780 --> 00:10:46.600
what we're going to do is
we're going to insert A of i
00:10:46.600 --> 00:10:48.470
in the right position.
00:10:48.470 --> 00:10:51.170
And we're going
to assume that we
00:10:51.170 --> 00:10:55.220
are sort of midway through
the sorting process, where
00:10:55.220 --> 00:11:00.920
we have sorted A 0
through i minus 1.
00:11:00.920 --> 00:11:04.340
And we're going to
expand this to this array
00:11:04.340 --> 00:11:07.590
to have i plus 1 elements.
00:11:07.590 --> 00:11:09.650
And A of i is going
to get inserted
00:11:09.650 --> 00:11:12.830
into the correct position.
00:11:12.830 --> 00:11:23.640
And we're going to do
this by pairwise swaps
00:11:23.640 --> 00:11:32.730
down to the correct position
for the number that is initially
00:11:32.730 --> 00:11:33.490
in A of i.
00:11:36.050 --> 00:11:42.410
So let's go through
an example of this.
00:11:42.410 --> 00:11:44.840
We're going to sort
in increasing order.
00:11:44.840 --> 00:11:45.885
Just have six numbers.
00:11:50.430 --> 00:11:54.805
And initially, we
have 5, 2, 4, 6, 1, 3.
00:11:54.805 --> 00:11:56.430
And we're going to
take a look at this.
00:11:56.430 --> 00:12:00.550
And you start with the index
1, or the second element,
00:12:00.550 --> 00:12:03.620
because the very first
element-- it's a single element
00:12:03.620 --> 00:12:06.050
and it's already
sorted by definition.
00:12:06.050 --> 00:12:07.930
But you start from here.
00:12:07.930 --> 00:12:10.890
And this is what
we call our key.
00:12:10.890 --> 00:12:15.250
And that's essentially a pointer
to where we're at, right now.
00:12:15.250 --> 00:12:17.020
And the key keeps
moving to the right
00:12:17.020 --> 00:12:20.007
as we go through the different
steps of the algorithm.
00:12:20.007 --> 00:12:21.590
And so what you do
is you look at this
00:12:21.590 --> 00:12:24.830
and you have-- this is A of i.
00:12:24.830 --> 00:12:26.030
That's your key.
00:12:26.030 --> 00:12:30.070
And you have A of
0 to 0, which is 5.
00:12:30.070 --> 00:12:34.260
And since we want to
sort in increasing order,
00:12:34.260 --> 00:12:35.940
this is not sorted.
00:12:35.940 --> 00:12:37.720
And so we do a swap.
00:12:37.720 --> 00:12:42.400
So what this would do in
this step is to do a swap.
00:12:42.400 --> 00:12:51.830
And we would go obtain
2, 5, 4, 6, 1, 3.
00:12:51.830 --> 00:12:55.080
So all that's happened here,
in this step-- in the very
00:12:55.080 --> 00:12:57.360
first step where the key
is in the second position--
00:12:57.360 --> 00:13:00.020
is one swap happened.
00:13:00.020 --> 00:13:03.340
Now, your key is
here, at item 4.
00:13:03.340 --> 00:13:05.980
Again, you need to put
4 into the right spot.
00:13:05.980 --> 00:13:08.670
And so you do pairwise swaps.
00:13:08.670 --> 00:13:11.280
And in this case, you
have to do one swap.
00:13:11.280 --> 00:13:12.750
And you get 2, 4, 5.
00:13:12.750 --> 00:13:15.650
And you're done
with this iteration.
00:13:15.650 --> 00:13:27.850
So what happens here is
you have 2, 4, 5, 6, 1, 3.
00:13:27.850 --> 00:13:33.010
And now, the key
is over here, at 6.
00:13:33.010 --> 00:13:37.860
Now, at this point,
things are kind of easy,
00:13:37.860 --> 00:13:41.180
in the sense that you look
at it and you say, well, I
00:13:41.180 --> 00:13:43.480
know this part is
already started.
00:13:43.480 --> 00:13:44.970
6 is greater than 5.
00:13:44.970 --> 00:13:47.000
So you have to do nothing.
00:13:47.000 --> 00:13:51.530
So there's no swaps that
happen in this step.
00:13:51.530 --> 00:13:56.440
So all that happens
here is you're
00:13:56.440 --> 00:14:02.280
going to move the key to
one step to the right.
00:14:02.280 --> 00:14:06.370
So you have 2, 4, 5, 6, 1, 3.
00:14:06.370 --> 00:14:10.270
And your key is now at 1.
00:14:10.270 --> 00:14:11.910
Here, you have to do more work.
00:14:11.910 --> 00:14:16.770
Now, you see one aspect of the
complexity of this algorithm--
00:14:16.770 --> 00:14:19.470
given that you're doing
pairwise swaps-- the way
00:14:19.470 --> 00:14:23.420
this algorithm was defined, in
pseudocode, out there, was I'm
00:14:23.420 --> 00:14:27.760
going to use pairwise swaps
to find the correct position.
00:14:27.760 --> 00:14:29.640
So what you're going
to do is you're
00:14:29.640 --> 00:14:34.080
going to have to
swap first 1 and 6.
00:14:34.080 --> 00:14:36.310
And then you'll
swap-- 1 is over here.
00:14:36.310 --> 00:14:39.970
So you'll swap this
position and that position.
00:14:39.970 --> 00:14:44.580
And then you'll
swap-- essentially,
00:14:44.580 --> 00:14:49.910
do 4 swaps to get to
the point where you have
00:14:49.910 --> 00:14:52.970
1, 2, 4, 5, 6, 3.
00:14:52.970 --> 00:14:56.650
So this is the result.
00:14:59.190 --> 00:15:03.770
1, 2, 4, 5, 6, 3.
00:15:03.770 --> 00:15:06.360
And the important thing
to understand, here,
00:15:06.360 --> 00:15:09.050
is that you've done
four swaps to get 1
00:15:09.050 --> 00:15:10.160
to the correct position.
00:15:10.160 --> 00:15:12.480
Now, you could imagine a
different data structure
00:15:12.480 --> 00:15:15.470
where you move this over
there and you shift them
00:15:15.470 --> 00:15:16.930
all to the right.
00:15:16.930 --> 00:15:20.230
But in fact, that shifting
of these four elements
00:15:20.230 --> 00:15:23.630
is going to be computed
in our model as four
00:15:23.630 --> 00:15:26.244
operations, or
four steps, anyway.
00:15:26.244 --> 00:15:27.910
So there's no getting
away from the fact
00:15:27.910 --> 00:15:30.660
that you have to do
four things here.
00:15:30.660 --> 00:15:36.830
And the way the code that
we have for insertion sort
00:15:36.830 --> 00:15:39.400
does this is by
using pairwise swaps.
00:15:39.400 --> 00:15:41.470
So we're almost done.
00:15:41.470 --> 00:15:49.490
Now, we have the key at 3.
00:15:49.490 --> 00:15:52.910
And now, 3 needs to get put
into the correct position.
00:15:52.910 --> 00:15:55.350
And so you've got
to do a few swaps.
00:15:55.350 --> 00:15:58.320
This is the last step.
00:15:58.320 --> 00:16:03.580
And what happens here is 3 is
going to get swapped with 6.
00:16:03.580 --> 00:16:06.520
And then 3 needs to
get swapped with 5.
00:16:06.520 --> 00:16:09.770
And then 3 needs to
get swapped with 4.
00:16:09.770 --> 00:16:12.985
And then, since 3 is
greater than 2, you're done.
00:16:12.985 --> 00:16:16.325
So you have 1, 2, 3, 4, 5, 6.
00:16:18.880 --> 00:16:21.180
And that's it.
00:16:21.180 --> 00:16:22.820
So, analysis.
00:16:25.380 --> 00:16:26.630
How many steps do I have?
00:16:30.670 --> 00:16:32.150
AUDIENCE: n squared?
00:16:32.150 --> 00:16:36.310
PROFESSOR: No, how
many steps do I have?
00:16:36.310 --> 00:16:40.120
I guess that wasn't
a good question.
00:16:40.120 --> 00:16:43.930
If I think of a step as
being a movement of the key,
00:16:43.930 --> 00:16:46.215
how many steps do I have?
00:16:46.215 --> 00:16:49.930
I have theta n steps.
00:16:49.930 --> 00:16:56.570
And in this case, you can
think of it as n minus 1 steps,
00:16:56.570 --> 00:16:58.030
since you started with 2.
00:16:58.030 --> 00:17:03.900
But let's just call
it theta n steps,
00:17:03.900 --> 00:17:06.780
in terms of key positions.
00:17:10.060 --> 00:17:11.150
And you're right.
00:17:11.150 --> 00:17:15.349
It is n square because,
at any given step,
00:17:15.349 --> 00:17:19.730
it's quite possible that
I have to do theta n work.
00:17:19.730 --> 00:17:22.400
And one example is
this one, right here,
00:17:22.400 --> 00:17:25.160
where I had to do four swaps.
00:17:25.160 --> 00:17:27.599
And in general, you can
construct a scenario
00:17:27.599 --> 00:17:31.470
where, towards the
end of the algorithm,
00:17:31.470 --> 00:17:34.120
you'd have to do theta n work.
00:17:34.120 --> 00:17:37.560
But if you had a list
that was reverse sorted.
00:17:37.560 --> 00:17:40.960
You would, essentially,
have to do, on an average n
00:17:40.960 --> 00:17:43.850
by two swaps as you go
through each of the steps.
00:17:43.850 --> 00:17:45.300
And that's theta n.
00:17:45.300 --> 00:17:52.150
So each step is theta n swaps.
00:17:55.930 --> 00:17:58.740
And when I say
swaps, I could also
00:17:58.740 --> 00:18:06.645
say each step is theta
n compares and swaps.
00:18:06.645 --> 00:18:08.020
And this is going
to be important
00:18:08.020 --> 00:18:10.430
because I'm going to ask
you an interesting question
00:18:10.430 --> 00:18:11.700
in a minute.
00:18:11.700 --> 00:18:13.840
But let me summarize.
00:18:13.840 --> 00:18:16.470
What I have here is a
theta n squared algorithm.
00:18:16.470 --> 00:18:17.970
The reason this is
a theta n squared
00:18:17.970 --> 00:18:22.760
algorithm is because
I have theta n steps
00:18:22.760 --> 00:18:26.860
and each step is theta n.
00:18:26.860 --> 00:18:29.140
When I'm counting,
what am I counting
00:18:29.140 --> 00:18:30.730
it terms of operations?
00:18:30.730 --> 00:18:33.510
The assumption here--
unspoken assumption--
00:18:33.510 --> 00:18:36.810
has been that an operation
is a compare and a swap
00:18:36.810 --> 00:18:39.540
and they're, essentially,
equal in cost.
00:18:39.540 --> 00:18:41.850
And in most computers,
that's true.
00:18:41.850 --> 00:18:45.210
You have a single
instruction and, say, the x86
00:18:45.210 --> 00:18:47.700
or the MIPS architecture
that can do a compare,
00:18:47.700 --> 00:18:50.660
and the same thing for
swapping registers.
00:18:50.660 --> 00:18:52.640
So perfectly
reasonably assumption
00:18:52.640 --> 00:18:56.480
that compares and
swaps for numbers
00:18:56.480 --> 00:18:58.410
have exactly the same cost.
00:18:58.410 --> 00:19:01.900
But if you had a record and
you were comparing records,
00:19:01.900 --> 00:19:05.700
and the comparison function that
you used for the records was
00:19:05.700 --> 00:19:08.820
in itself a method
call or a subroutine,
00:19:08.820 --> 00:19:11.290
it's quite possible
that all you're doing
00:19:11.290 --> 00:19:15.600
is swapping pointers or
references to do the swap,
00:19:15.600 --> 00:19:17.985
but the comparison could be
substantially more expensive.
00:19:22.870 --> 00:19:24.920
Most of the time-- and
we'll differentiate
00:19:24.920 --> 00:19:27.150
if it becomes
necessary-- we're going
00:19:27.150 --> 00:19:29.560
to be counting comparisons
in the sorting algorithms
00:19:29.560 --> 00:19:31.230
that we'll be putting out.
00:19:31.230 --> 00:19:36.130
And we'll be assuming that
either comparison swaps are
00:19:36.130 --> 00:19:41.040
roughly the same or
that compares are--
00:19:41.040 --> 00:19:44.570
and we'll say which one,
of course-- that compares
00:19:44.570 --> 00:19:47.830
are substantially more
expensive than swaps.
00:19:47.830 --> 00:19:52.270
So if you had either of those
cases for insertion sort,
00:19:52.270 --> 00:19:54.226
you have a theta n
squared algorithm.
00:19:54.226 --> 00:19:55.600
You have theta n
squared compares
00:19:55.600 --> 00:19:58.200
and theta n squared swaps.
00:19:58.200 --> 00:20:00.780
Now, here's a question.
00:20:00.780 --> 00:20:11.179
Let's say that compares are
more expensive than swaps.
00:20:11.179 --> 00:20:12.720
And so, I'm concerned
about the theta
00:20:12.720 --> 00:20:14.750
n squared comparison cost.
00:20:17.270 --> 00:20:20.880
I'm not as concerned, because of
the constant factors involved,
00:20:20.880 --> 00:20:22.710
with the theta n
squared swap cost.
00:20:25.410 --> 00:20:28.730
This is a question question.
00:20:28.730 --> 00:20:33.590
What's a simple fix-- change
to this algorithm that
00:20:33.590 --> 00:20:37.260
would give me a better
complexity in the case
00:20:37.260 --> 00:20:39.900
where compares are
more expensive,
00:20:39.900 --> 00:20:43.300
or I'm only looking at the
complexity of compares.
00:20:43.300 --> 00:20:46.990
So the theta
whatever of compares.
00:20:46.990 --> 00:20:47.950
Anyone?
00:20:47.950 --> 00:20:48.661
Yeah, back there.
00:20:48.661 --> 00:20:49.536
AUDIENCE: [INAUDIBLE]
00:20:56.356 --> 00:20:58.230
PROFESSOR: You could
compare with the middle.
00:20:58.230 --> 00:20:59.021
What did I call it?
00:21:01.910 --> 00:21:03.120
I called it something.
00:21:03.120 --> 00:21:06.161
What you just said, I
called it something.
00:21:06.161 --> 00:21:07.160
AUDIENCE: Binary search.
00:21:07.160 --> 00:21:07.740
PROFESSOR: Binary search.
00:21:07.740 --> 00:21:08.310
That's right.
00:21:08.310 --> 00:21:10.280
Two cushions for this one.
00:21:10.280 --> 00:21:12.221
So you pick them
up after lecture.
00:21:12.221 --> 00:21:13.220
So you're exactly right.
00:21:13.220 --> 00:21:13.928
You got it right.
00:21:13.928 --> 00:21:18.160
I called it binary
search, up here.
00:21:18.160 --> 00:21:21.620
And so you can
take insertion sort
00:21:21.620 --> 00:21:24.800
and you can sort of trivially
turn it into a theta n log n
00:21:24.800 --> 00:21:27.200
algorithm if we
are talking about n
00:21:27.200 --> 00:21:29.910
being the number of compares.
00:21:29.910 --> 00:21:32.425
And all you have to do
to do that is to say,
00:21:32.425 --> 00:21:34.280
you know what, I'm
going to replace
00:21:34.280 --> 00:21:37.950
this with binary search.
00:21:37.950 --> 00:21:42.720
And you can do that-- and
that was the key observation--
00:21:42.720 --> 00:21:47.990
because A of 0 through i
minus 1 is already sorted.
00:21:47.990 --> 00:21:51.909
And so you can do binary search
on that part of the array.
00:21:51.909 --> 00:21:53.200
So let me just write that down.
00:21:56.750 --> 00:22:04.000
Do a binary search on A
of 0 through i minus 1,
00:22:04.000 --> 00:22:05.095
which is already sorted.
00:22:10.540 --> 00:22:16.780
And essentially, you can think
of it as theta log i time,
00:22:16.780 --> 00:22:18.070
and for each of those steps.
00:22:18.070 --> 00:22:27.251
And so then you get your
theta n log n theta n log
00:22:27.251 --> 00:22:30.410
n in terms of compares.
00:22:30.410 --> 00:22:37.940
Does this help the swaps
for an array data structure?
00:22:37.940 --> 00:22:41.280
No, because binary search
will require insertion
00:22:41.280 --> 00:22:44.670
into A of 0 though i minus 1.
00:22:44.670 --> 00:22:45.880
So here's the problem.
00:22:45.880 --> 00:22:50.430
Why don't we have a full-fledged
theta n log n algorithm,
00:22:50.430 --> 00:22:53.940
regardless of the cost
of compares or swaps?
00:22:53.940 --> 00:22:55.470
We don't quite have that.
00:22:55.470 --> 00:23:02.950
We don't quite have that because
we need to insert our A of i
00:23:02.950 --> 00:23:07.850
into the right position into
A of 0 through i minus 1.
00:23:07.850 --> 00:23:09.790
You do that if you have
an array structure,
00:23:09.790 --> 00:23:10.998
it might get into the middle.
00:23:10.998 --> 00:23:13.337
And you have to shift
things over to the right.
00:23:13.337 --> 00:23:15.170
And when you shift
things over to the right,
00:23:15.170 --> 00:23:17.090
in the worst case, you may
be shifting a lot of things
00:23:17.090 --> 00:23:17.980
over to the right.
00:23:17.980 --> 00:23:20.630
And that gets back to worst
case complexity of theta n.
00:23:23.200 --> 00:23:27.000
So a binary search
in insertion sort
00:23:27.000 --> 00:23:29.197
gives you theta n
log n for compares.
00:23:29.197 --> 00:23:30.905
But it's still theta
n squared for swaps.
00:23:35.000 --> 00:23:36.805
So as you can see,
there's many varieties
00:23:36.805 --> 00:23:37.770
of sorting algorithms.
00:23:37.770 --> 00:23:39.850
We just looked at
a couple of them.
00:23:39.850 --> 00:23:43.010
And they were both
insertion sort.
00:23:43.010 --> 00:23:45.040
The second one
that I just put up
00:23:45.040 --> 00:23:48.900
is, I guess, technically
called binary insertion sort
00:23:48.900 --> 00:23:50.710
because it does binary search.
00:23:50.710 --> 00:23:53.000
And the vanilla
insertion sort is
00:23:53.000 --> 00:23:56.676
the one that you have the code
for in the doc dis program,
00:23:56.676 --> 00:23:59.400
or at least one of
the doc dis files.
00:23:59.400 --> 00:24:04.620
So let's move on and talk
about a different algorithm.
00:24:04.620 --> 00:24:06.830
So what we'd like to
do, now-- this class
00:24:06.830 --> 00:24:09.120
is about constant improvement.
00:24:09.120 --> 00:24:11.480
We're never happy.
00:24:11.480 --> 00:24:14.370
We always want to do
a little bit better.
00:24:14.370 --> 00:24:16.864
And eventually, once
we run out of room
00:24:16.864 --> 00:24:18.280
from an asymptotic
standpoint, you
00:24:18.280 --> 00:24:20.363
take these other classes
where you try and improve
00:24:20.363 --> 00:24:24.380
constant factors and
get 10%, and 5%, and 1%,
00:24:24.380 --> 00:24:25.560
and so on, and so forth.
00:24:25.560 --> 00:24:31.200
But we'll stick to improving
asymptotic complexity.
00:24:31.200 --> 00:24:34.190
And we're not quite happy
with binary insertion sort
00:24:34.190 --> 00:24:37.050
because, in the case of numbers,
our binary insertion sort
00:24:37.050 --> 00:24:40.709
has theta n squared complexity,
if you look at swaps.
00:24:40.709 --> 00:24:43.042
So we'd like to go find an
algorithm that is theta n log
00:24:43.042 --> 00:24:44.810
n.
00:24:44.810 --> 00:24:49.600
And I guess, eventually,
we'll have to stop.
00:24:49.600 --> 00:24:51.260
But Erik will take care of that.
00:24:53.900 --> 00:24:54.970
There's a reason to stop.
00:24:54.970 --> 00:24:58.620
It's when you can prove that
you can't do any better.
00:24:58.620 --> 00:25:01.210
And so we'll get to
that, eventually.
00:25:01.210 --> 00:25:04.685
So merge sort is also something
that you've probably seen.
00:25:07.277 --> 00:25:08.735
But there probably
will be a couple
00:25:08.735 --> 00:25:12.440
of subtleties that come out as
I describe this algorithm that,
00:25:12.440 --> 00:25:15.340
hopefully, will be interesting
to those of you who already
00:25:15.340 --> 00:25:16.810
know merge sort.
00:25:16.810 --> 00:25:21.030
And for those of you who don't,
it's a very pretty algorithm.
00:25:21.030 --> 00:25:26.930
It's a standard recursion
algorithm-- recursive
00:25:26.930 --> 00:25:30.620
algorithm-- similar
to a binary search.
00:25:30.620 --> 00:25:34.780
What we do, here, is we have
an array, A. We split it
00:25:34.780 --> 00:25:42.095
into two parts, L and R.
And essentially, we kind of
00:25:42.095 --> 00:25:43.950
do no work, really.
00:25:43.950 --> 00:25:49.814
In terms of the L and R in
the sense that we just call,
00:25:49.814 --> 00:25:51.480
we keep splitting,
splitting, splitting.
00:25:51.480 --> 00:25:54.020
And all the work is
done down at the bottom
00:25:54.020 --> 00:25:57.570
in this routine called
merge, where we are merging
00:25:57.570 --> 00:26:00.110
a pair of elements
at the leaves.
00:26:00.110 --> 00:26:04.490
And then, we merge two
pairs and get four elements.
00:26:04.490 --> 00:26:08.630
And then we merge four tuples
of elements, et cetera,
00:26:08.630 --> 00:26:10.080
and go all the way up.
00:26:10.080 --> 00:26:18.990
So while I'm just saying L
terms into L prime, out here,
00:26:18.990 --> 00:26:20.990
there's no real
explicit code that you
00:26:20.990 --> 00:26:23.870
can see that turns
L into L prime.
00:26:23.870 --> 00:26:25.630
It happens really later.
00:26:25.630 --> 00:26:27.190
There's no real
sorting code, here.
00:26:27.190 --> 00:26:28.790
It happens in the merge routine.
00:26:28.790 --> 00:26:30.649
And you'll see
that quite clearly
00:26:30.649 --> 00:26:31.940
when we run through an example.
00:26:34.960 --> 00:26:41.500
So you have L and R turn
into L prime and R prime.
00:26:41.500 --> 00:26:52.310
And what we end up getting
is a sorted array, A.
00:26:52.310 --> 00:26:58.900
And we have what's called
a merge routine that
00:26:58.900 --> 00:27:01.110
takes L prime and R
prime and merges them
00:27:01.110 --> 00:27:02.400
into the sorted array.
00:27:02.400 --> 00:27:09.270
So at the top level, what
you see is split into two,
00:27:09.270 --> 00:27:13.280
and do a merge, and get
to the sorted array.
00:27:13.280 --> 00:27:16.680
The input is of size n.
00:27:16.680 --> 00:27:24.690
You have two arrays
of size n over 2.
00:27:24.690 --> 00:27:33.210
These are two sorted
arrays of size n over 2.
00:27:33.210 --> 00:27:39.480
And then, finally, you have
a sorted array of size n.
00:27:42.116 --> 00:27:44.240
So if you want to follow
the recursive of execution
00:27:44.240 --> 00:27:49.870
of this in a small
example, then you'll
00:27:49.870 --> 00:27:53.790
be able to see how this works.
00:27:53.790 --> 00:27:56.120
And we'll do a fairly
straightforward example
00:27:56.120 --> 00:27:58.200
with 8 elements.
00:27:58.200 --> 00:28:03.180
So at the top level--
before we get there, merge
00:28:03.180 --> 00:28:08.640
is going to assume that
you have two sorted arrays,
00:28:08.640 --> 00:28:11.700
and merge them together.
00:28:11.700 --> 00:28:15.960
That's the invariant in merge
sort, or for the merge routine.
00:28:15.960 --> 00:28:19.570
It assumes the inputs are
sorted-- L and R. Actually
00:28:19.570 --> 00:28:22.800
I should say, L
prime and R prime.
00:28:22.800 --> 00:28:27.624
So let's say you have
20, 13, 7, and 2.
00:28:27.624 --> 00:28:31.320
You have 12, 11, 9, and 1.
00:28:31.320 --> 00:28:33.400
And this could be L prime.
00:28:33.400 --> 00:28:36.840
And this could be R prime.
00:28:36.840 --> 00:28:39.650
What you have is what we
call a two finger algorithm.
00:28:39.650 --> 00:28:42.380
And so you've got two
fingers and each of them
00:28:42.380 --> 00:28:44.162
point to something.
00:28:44.162 --> 00:28:45.870
And in this case, one
of them is pointing
00:28:45.870 --> 00:28:49.190
to L. My left finger
is pointing to L prime,
00:28:49.190 --> 00:28:50.800
or some element L prime.
00:28:50.800 --> 00:28:53.850
My right finger is pointing
to some element in R prime.
00:28:53.850 --> 00:28:56.820
And I'm going to
compare the two elements
00:28:56.820 --> 00:28:58.740
that my fingers are pointing to.
00:28:58.740 --> 00:29:02.170
And I'm going to
choose, in this case,
00:29:02.170 --> 00:29:03.840
the smaller of those elements.
00:29:03.840 --> 00:29:07.790
And I'm going to put them
into the sorted array.
00:29:07.790 --> 00:29:10.970
So start out here.
00:29:10.970 --> 00:29:12.480
Look at that and that.
00:29:12.480 --> 00:29:14.266
And I compared 2 and 1.
00:29:14.266 --> 00:29:15.140
And which is smaller?
00:29:15.140 --> 00:29:16.310
1 is smaller.
00:29:16.310 --> 00:29:19.130
So I'm going to write 1 down.
00:29:19.130 --> 00:29:23.720
This is a two finger
algo for merge.
00:29:23.720 --> 00:29:24.880
And I put 1 down.
00:29:24.880 --> 00:29:27.380
When I put 1 down, I
had to cross out 1.
00:29:27.380 --> 00:29:29.395
So effectively, what
happens is-- let
00:29:29.395 --> 00:29:31.460
me just circle that
instead of crossing it out.
00:29:31.460 --> 00:29:35.450
And my finger moves up to 9.
00:29:35.450 --> 00:29:38.110
So now I'm pointing at 2 and 9.
00:29:38.110 --> 00:29:40.080
And I repeat this step.
00:29:40.080 --> 00:29:41.870
So now, in this
case, 2 is smaller.
00:29:41.870 --> 00:29:44.040
So I'm going to go
ahead and write 2 down.
00:29:44.040 --> 00:29:49.420
And I can cross out 2 and
move my finger up to 7.
00:29:49.420 --> 00:29:50.840
And so that's it.
00:29:50.840 --> 00:29:54.010
I won't bore you with
the rest of the steps.
00:29:54.010 --> 00:29:56.114
It's essentially walking up.
00:29:56.114 --> 00:29:57.780
You have a couple of
pointers and you're
00:29:57.780 --> 00:29:59.920
walking up these two arrays.
00:29:59.920 --> 00:30:07.230
And you're writing down 1,
2, 7, 9, 11, 12, 13, 20.
00:30:07.230 --> 00:30:08.730
And that's your merge routine.
00:30:08.730 --> 00:30:12.330
And all of the work, really,
is done in the merge routine
00:30:12.330 --> 00:30:15.460
because, other than
that, the body is simply
00:30:15.460 --> 00:30:16.620
a recursive call.
00:30:16.620 --> 00:30:18.420
You have to, obviously,
split the array.
00:30:18.420 --> 00:30:20.110
But that's fairly
straightforward.
00:30:20.110 --> 00:30:24.600
If you have an array, A 0
through n-- and depending on
00:30:24.600 --> 00:30:28.300
whether n is odd
or even-- you could
00:30:28.300 --> 00:30:38.530
imagine that you set L
to be A 0 n by 2 minus 1,
00:30:38.530 --> 00:30:41.420
and R similarly.
00:30:41.420 --> 00:30:44.086
And so you just split it
halfway in the middle.
00:30:44.086 --> 00:30:45.710
I'll talk about that
a little bit more.
00:30:45.710 --> 00:30:47.334
There's a subtlety
associated with that
00:30:47.334 --> 00:30:51.200
that we'll get to
in a few minutes.
00:30:51.200 --> 00:30:55.280
But to finish up in terms of
the computation of merge sort.
00:30:55.280 --> 00:30:56.110
This is it.
00:30:56.110 --> 00:31:00.827
The merge routine is doing
most, if not all, of the work.
00:31:00.827 --> 00:31:02.410
And this two finger
algorithm is going
00:31:02.410 --> 00:31:04.630
to be able to take
two sorted arrays
00:31:04.630 --> 00:31:09.550
and put them into a
single sorted array
00:31:09.550 --> 00:31:13.150
by interspersing, or
interleaving, these elements.
00:31:13.150 --> 00:31:15.000
And what's the
complexity of merge
00:31:15.000 --> 00:31:18.710
if I have two arrays
of size n over 2, here?
00:31:18.710 --> 00:31:21.810
What do I have?
00:31:21.810 --> 00:31:22.590
AUDIENCE: n.
00:31:22.590 --> 00:31:23.731
PROFESSOR: n.
00:31:23.731 --> 00:31:24.980
We'll give you a cushion, too.
00:31:28.050 --> 00:31:29.165
theta n complexity.
00:31:35.470 --> 00:31:36.290
So far so good.
00:31:38.830 --> 00:31:41.640
I know you know the
answer as to what
00:31:41.640 --> 00:31:43.550
the complexity of merge sort is.
00:31:43.550 --> 00:31:45.180
But I'm guessing
that most of you
00:31:45.180 --> 00:31:47.900
won't be able to prove it to me
because I'm kind of a hard guy
00:31:47.900 --> 00:31:50.920
to prove something to.
00:31:50.920 --> 00:31:53.040
And I could always say,
no, I don't believe you
00:31:53.040 --> 00:31:53.956
or I don't understand.
00:31:57.960 --> 00:32:00.880
The complexity-- and you've
said this before, in class,
00:32:00.880 --> 00:32:02.580
and I think Erik's
mentioned it--
00:32:02.580 --> 00:32:08.370
the overall complexity of this
algorithm is theta n log n
00:32:08.370 --> 00:32:09.810
And where does that come from?
00:32:09.810 --> 00:32:11.790
How do you prove that?
00:32:11.790 --> 00:32:16.840
And so what we'll do, now,
is take a look at merge sort.
00:32:16.840 --> 00:32:19.070
And we'll look at
the recursion tree.
00:32:19.070 --> 00:32:20.695
And we'll try and--
there are many ways
00:32:20.695 --> 00:32:23.370
of proving that merge
sort is theta n log n.
00:32:23.370 --> 00:32:25.860
The way we're
going to do this is
00:32:25.860 --> 00:32:28.640
what's called proof by picture.
00:32:28.640 --> 00:32:32.290
And it's not an established
proof technique,
00:32:32.290 --> 00:32:35.020
but it's something
that is very helpful
00:32:35.020 --> 00:32:38.100
to get an intuition
behind the proof
00:32:38.100 --> 00:32:40.441
and why the result is true.
00:32:40.441 --> 00:32:41.940
And you can always
take that and you
00:32:41.940 --> 00:32:47.030
can formalize it and
make this something
00:32:47.030 --> 00:32:49.680
that everyone believes.
00:32:49.680 --> 00:32:52.960
And we'll also look at
substitution, possibly
00:32:52.960 --> 00:32:56.310
in section tomorrow,
for recurrence solving.
00:32:56.310 --> 00:33:00.540
So where we're right now is that
we have a divide and conquer
00:33:00.540 --> 00:33:07.710
algorithm that has a merge
step that is theta n.
00:33:07.710 --> 00:33:12.540
And so, if I just look at this
structure that I have here,
00:33:12.540 --> 00:33:16.150
I can write a recurrence
for merge sort
00:33:16.150 --> 00:33:17.966
that looks like this.
00:33:17.966 --> 00:33:22.720
So when I say
complexity, I can say
00:33:22.720 --> 00:33:26.230
T of n, which is the
work done for n items,
00:33:26.230 --> 00:33:28.910
is going to be some
constant time in order
00:33:28.910 --> 00:33:31.940
to divide the array.
00:33:31.940 --> 00:33:34.200
So this could be the
part corresponding
00:33:34.200 --> 00:33:36.360
to dividing the array.
00:33:36.360 --> 00:33:40.360
And there's going to be two
problems of size n over 2.
00:33:40.360 --> 00:33:42.810
And so I have 2 T of n over 2.
00:33:42.810 --> 00:33:44.710
And this is the recursive part.
00:33:48.650 --> 00:33:53.960
And I'm going to have c times
n, which is the merge part.
00:33:53.960 --> 00:33:58.910
And that's some constant times
n, which is what we have,
00:33:58.910 --> 00:34:01.890
here, with respect to
the theta n complexity.
00:34:01.890 --> 00:34:04.980
So you have a recurrence like
this and I know some of you
00:34:04.980 --> 00:34:07.150
have seen recurrences in 6.042.
00:34:07.150 --> 00:34:09.239
And you know how to solve this.
00:34:09.239 --> 00:34:14.469
What I'd like to do is show you
this recursion tree expansion
00:34:14.469 --> 00:34:17.989
that, not only tells you how
to solve this occurrence,
00:34:17.989 --> 00:34:23.102
but also gives you a means
of solving recurrences where,
00:34:23.102 --> 00:34:25.560
instead of having c of n, you
have something else out here.
00:34:25.560 --> 00:34:27.790
You have f of n, which
is a different function
00:34:27.790 --> 00:34:29.280
from the linear function.
00:34:29.280 --> 00:34:33.750
And this recursion
tree is, in my mind,
00:34:33.750 --> 00:34:38.650
the simplest way of
arguing the theta n log n
00:34:38.650 --> 00:34:41.100
complexity of merge sort.
00:34:41.100 --> 00:34:44.339
So what I want to do is
expand this recurrence out.
00:34:44.339 --> 00:34:45.505
And let's do that over here.
00:35:06.830 --> 00:35:10.950
So I have c of n on top.
00:35:10.950 --> 00:35:15.850
I'm going to ignore this
constant factor because c of n
00:35:15.850 --> 00:35:16.550
dominates.
00:35:16.550 --> 00:35:18.080
So I'll just start with c of n.
00:35:18.080 --> 00:35:23.450
I want to break things
up, as I do the recursion.
00:35:23.450 --> 00:35:26.960
So when I go c of n, at
the top level-- that's
00:35:26.960 --> 00:35:29.750
the work I have to do at
the merge, at the top level.
00:35:29.750 --> 00:35:33.110
And then when I go down to two
smaller problems, each of them
00:35:33.110 --> 00:35:34.480
is size n over 2.
00:35:34.480 --> 00:35:38.440
So I do c times n
divided by 2 [INAUDIBLE].
00:35:38.440 --> 00:35:41.617
So this is just a constant c.
00:35:41.617 --> 00:35:43.200
I didn't want to
write thetas up here.
00:35:43.200 --> 00:35:44.440
You could.
00:35:44.440 --> 00:35:46.760
And I'll say a little bit
more about that later.
00:35:46.760 --> 00:35:49.180
But think of this cn as
representing the theta n
00:35:49.180 --> 00:35:50.260
complexity.
00:35:50.260 --> 00:35:52.790
And c is this constant.
00:35:52.790 --> 00:35:57.960
So c times n, here. c
times n over 2, here.
00:35:57.960 --> 00:36:01.760
And then when I keep going,
I have c times n over 4,
00:36:01.760 --> 00:36:08.910
c times n over 4, et cetera,
and so on, and so forth.
00:36:08.910 --> 00:36:10.650
And when I come down
all the way here,
00:36:10.650 --> 00:36:16.670
n is eventually going to become
1-- or essentially a constant--
00:36:16.670 --> 00:36:20.790
and I'm going to have
a bunch of c's here.
00:36:20.790 --> 00:36:27.050
So here's another question,
that I'd like you to answer.
00:36:27.050 --> 00:36:31.210
Someone tell me what the number
of levels in this tree are,
00:36:31.210 --> 00:36:34.060
precisely, and the number
of leaves in this tree are,
00:36:34.060 --> 00:36:35.570
precisely.
00:36:35.570 --> 00:36:38.061
AUDIENCE: The number of
levels is log n plus 1.
00:36:38.061 --> 00:36:39.060
PROFESSOR: Log n plus 1.
00:36:39.060 --> 00:36:41.169
Log to the base 2 plus 1.
00:36:41.169 --> 00:36:42.210
And the number of leaves?
00:36:48.430 --> 00:36:50.580
You raised your hand
back there, first.
00:36:50.580 --> 00:36:51.430
Number of leaves.
00:36:51.430 --> 00:36:52.880
AUDIENCE: I think n.
00:36:52.880 --> 00:36:54.130
PROFESSOR: Yeah, you're right.
00:36:54.130 --> 00:36:56.210
You think right.
00:36:56.210 --> 00:37:02.520
So 1 plus log n and n leaves.
00:37:02.520 --> 00:37:05.870
When n becomes 1, how
many of them do you have?
00:37:05.870 --> 00:37:09.470
You're down to a single element,
which is, by definition,
00:37:09.470 --> 00:37:10.580
sorted.
00:37:10.580 --> 00:37:13.730
And you have n leaves.
00:37:13.730 --> 00:37:17.020
So now let's add up the work.
00:37:17.020 --> 00:37:20.230
I really like this
picture because it's just
00:37:20.230 --> 00:37:23.450
so intuitive in terms
of getting us the result
00:37:23.450 --> 00:37:25.090
that we're looking for.
00:37:25.090 --> 00:37:30.080
So you add up the work in each
of the levels of this tree.
00:37:30.080 --> 00:37:32.190
So the top level is cn.
00:37:32.190 --> 00:37:39.790
The second level is cn because
I added 1/2 and 1/2, cn, cn.
00:37:39.790 --> 00:37:40.750
Wow.
00:37:40.750 --> 00:37:43.010
What symmetry.
00:37:43.010 --> 00:37:50.500
So you're doing the same
amount of work modulo
00:37:50.500 --> 00:37:54.050
the constant factors,
here, with what's
00:37:54.050 --> 00:37:56.280
going on with the c1,
which we've ignored,
00:37:56.280 --> 00:37:59.870
but roughly the same amount
of work in each of the levels.
00:37:59.870 --> 00:38:02.570
And now, you know how
many levels there are.
00:38:02.570 --> 00:38:04.850
It's 1 plus log n.
00:38:04.850 --> 00:38:11.930
So if you want to write
an equation for T of n,
00:38:11.930 --> 00:38:23.030
it's 1 plus log n times c of
n, which is theta of n log n.
00:38:26.520 --> 00:38:31.049
So I've mixed in
constants c and thetas.
00:38:31.049 --> 00:38:32.590
For the purposes of
this description,
00:38:32.590 --> 00:38:33.950
they're interchangeable.
00:38:33.950 --> 00:38:38.095
You will see recurrences that
look like this, in class.
00:38:45.210 --> 00:38:46.860
And things like that.
00:38:46.860 --> 00:38:48.370
Don't get confused.
00:38:48.370 --> 00:38:51.150
It's just a constant
multiplicative factor
00:38:51.150 --> 00:38:54.510
in front of the
function that you have.
00:38:54.510 --> 00:38:56.230
And it's just a little
easier, I think,
00:38:56.230 --> 00:38:58.140
to write down these
constant factors
00:38:58.140 --> 00:39:00.510
and realize that the
amount of work done
00:39:00.510 --> 00:39:02.980
is the same in
each of the leaves.
00:39:02.980 --> 00:39:06.010
And once you know the
dimensions of this tree,
00:39:06.010 --> 00:39:08.930
in terms of levels and in
terms of the number of leaves,
00:39:08.930 --> 00:39:10.960
you get your result.
00:39:14.560 --> 00:39:17.425
So we've looked at
two algorithm, so far.
00:39:26.160 --> 00:39:29.540
And insertion sort, if
you talk about numbers,
00:39:29.540 --> 00:39:31.964
is theta n squared for swaps.
00:39:31.964 --> 00:39:33.130
Merge sort is theta n log n.
00:39:36.270 --> 00:39:38.680
Here's another
interesting question.
00:39:38.680 --> 00:39:44.720
What is one advantage of
insertion sort over merge sort?
00:39:50.176 --> 00:39:51.180
AUDIENCE: [INAUDIBLE]
00:39:51.180 --> 00:39:52.732
PROFESSOR: What does that mean?
00:39:52.732 --> 00:39:54.773
AUDIENCE: You don't have
to move elements outside
00:39:54.773 --> 00:39:56.960
of [INAUDIBLE].
00:39:56.960 --> 00:39:58.420
PROFESSOR: That's exactly right.
00:39:58.420 --> 00:40:01.330
That's exactly right.
00:40:01.330 --> 00:40:03.270
So the two guys who
answered the questions
00:40:03.270 --> 00:40:05.840
before with the levels, and you.
00:40:05.840 --> 00:40:07.740
Come to me after class.
00:40:07.740 --> 00:40:09.690
So that's a great answer.
00:40:09.690 --> 00:40:12.180
It's in-place
sorting is something
00:40:12.180 --> 00:40:14.820
that has to do with
auxiliary space.
00:40:14.820 --> 00:40:19.280
And so what you see, here--
and it was a bit hidden, here.
00:40:19.280 --> 00:40:21.940
But the fact of the
matter is that you
00:40:21.940 --> 00:40:25.530
had L prime and R prime.
00:40:25.530 --> 00:40:29.910
And L prime and R prime are
different from L and R, which
00:40:29.910 --> 00:40:33.440
were the initial halves of
the inputs to the sorting
00:40:33.440 --> 00:40:34.990
algorithm.
00:40:34.990 --> 00:40:38.630
And what I said here is, we're
going to dump this into A.
00:40:38.630 --> 00:40:40.440
That's what this picture shows.
00:40:40.440 --> 00:40:43.340
This says sorted
array, A. And so you
00:40:43.340 --> 00:40:48.720
had to make a copy of the
array-- the two halves L
00:40:48.720 --> 00:40:52.270
and R-- in order to
do the recursion,
00:40:52.270 --> 00:40:54.490
and then to take the
results and put them
00:40:54.490 --> 00:40:56.790
into the sorted array, A.
00:40:56.790 --> 00:40:59.220
So you needed-- in
merge sort-- you
00:40:59.220 --> 00:41:04.060
needed theta n auxiliary space.
00:41:04.060 --> 00:41:10.370
So merge sort, you need
theta n extra space.
00:41:10.370 --> 00:41:17.380
And the definition
of in-place sorting
00:41:17.380 --> 00:41:21.575
implies that you have theta
1-- constant-- auxiliary space.
00:41:24.580 --> 00:41:27.330
The auxiliary space
for insertion sort
00:41:27.330 --> 00:41:30.450
is simply that
temporary variable
00:41:30.450 --> 00:41:33.310
that you need when
you swap two elements.
00:41:33.310 --> 00:41:35.520
So when you want to swap
a couple of registers,
00:41:35.520 --> 00:41:38.070
you gotta store one of the
values in a temporary location,
00:41:38.070 --> 00:41:39.600
override the other, et cetera.
00:41:39.600 --> 00:41:43.190
And that's the theta 1 auxiliary
space for insertion sort.
00:41:43.190 --> 00:41:47.330
So there is an advantage of
the version of insertion sort
00:41:47.330 --> 00:41:49.140
we've talked about,
today, over merge sort.
00:41:49.140 --> 00:41:52.827
And if you have a billion
elements, that's potentially
00:41:52.827 --> 00:41:54.660
something you don't
want to store in memory.
00:41:54.660 --> 00:41:57.550
If you want to do something
really fast and do everything
00:41:57.550 --> 00:42:00.400
in cache or main
memory, and you want
00:42:00.400 --> 00:42:03.610
to sort billions are maybe
even trillions of items,
00:42:03.610 --> 00:42:07.740
this becomes an
important consideration.
00:42:07.740 --> 00:42:12.930
I will say that you can
reduce the constant factor
00:42:12.930 --> 00:42:14.530
of the theta n.
00:42:14.530 --> 00:42:16.590
So in the vanilla
scheme, you could
00:42:16.590 --> 00:42:18.690
imagine that you have to
have a copy of the array.
00:42:18.690 --> 00:42:20.900
So if you had n
elements, you essentially
00:42:20.900 --> 00:42:24.490
have n extra items of storage.
00:42:24.490 --> 00:42:28.130
You can make that n over 2
with a simple coding trick
00:42:28.130 --> 00:42:32.710
by keeping 1/2 of A.
00:42:32.710 --> 00:42:35.800
You can throw away one of
the L's or one of the R's.
00:42:35.800 --> 00:42:37.637
And you can get it
down to n over 2.
00:42:37.637 --> 00:42:39.470
And that turns out--
it's a reasonable thing
00:42:39.470 --> 00:42:41.410
to do if you have
a billion elements
00:42:41.410 --> 00:42:45.400
and you want to reduce your
storage by a constant factor.
00:42:45.400 --> 00:42:47.130
So that's one coding trick.
00:42:47.130 --> 00:42:49.630
Now it turns out that you
can actually go further.
00:42:49.630 --> 00:42:52.130
And there's a fairly
sophisticated algorithm
00:42:52.130 --> 00:42:54.740
that's sort of beyond
the scope of 6.006
00:42:54.740 --> 00:42:56.420
that's an in-place merge sort.
00:42:59.310 --> 00:43:03.070
And this in-place
merge sort is kind of
00:43:03.070 --> 00:43:08.590
impractical in the sense
that it doesn't do very well
00:43:08.590 --> 00:43:10.140
in terms of the
constant factors.
00:43:10.140 --> 00:43:15.120
So while it's in-place and
it's still theta n log n.
00:43:15.120 --> 00:43:19.720
The problem is that the running
time of an in-place merge sort
00:43:19.720 --> 00:43:23.210
is much worse than the
regular merge sort that
00:43:23.210 --> 00:43:25.510
uses theta n auxiliary space.
00:43:25.510 --> 00:43:28.100
So people don't really
use in-place merge sort.
00:43:28.100 --> 00:43:29.360
It's a great paper.
00:43:29.360 --> 00:43:31.800
It's a great thing to read.
00:43:31.800 --> 00:43:37.080
Its analysis is a bit
sophisticated for double 0 6.
00:43:37.080 --> 00:43:39.030
So we wont go there.
00:43:39.030 --> 00:43:40.330
But it does exist.
00:43:40.330 --> 00:43:42.003
So you can take merge
sort, and I just
00:43:42.003 --> 00:43:45.230
want to let you know that
you can do things in-place.
00:43:45.230 --> 00:43:50.560
In terms of numbers, some
experiments we ran a few years
00:43:50.560 --> 00:43:54.650
ago-- so these may not
be completely valid
00:43:54.650 --> 00:43:56.650
because I'm going to
actually give you numbers--
00:43:56.650 --> 00:44:07.380
but merge sort in Python, if
you write a little curve fit
00:44:07.380 --> 00:44:17.790
program to do this, is 2.2n log
n microseconds for a given n.
00:44:17.790 --> 00:44:19.625
So this is the
merge sort routine.
00:44:22.450 --> 00:44:32.230
And if you look at
insertion sort, in Python,
00:44:32.230 --> 00:44:39.410
that's something like 0.2
n square microseconds.
00:44:39.410 --> 00:44:42.700
So you see the
constant factors here.
00:44:42.700 --> 00:44:48.230
If you do insertion sort in C,
which is a compiled language,
00:44:48.230 --> 00:44:50.420
then, it's much faster.
00:44:50.420 --> 00:44:52.935
It's about 20 times faster.
00:44:55.440 --> 00:44:59.230
It's 0.01 n squared
microseconds.
00:44:59.230 --> 00:45:00.960
So a little bit of
practice on the side.
00:45:00.960 --> 00:45:02.714
We do ask you to write code.
00:45:02.714 --> 00:45:03.630
And this is important.
00:45:03.630 --> 00:45:04.930
The reason we're
interested in algorithms
00:45:04.930 --> 00:45:06.770
is because people
want to run them.
00:45:06.770 --> 00:45:13.860
And what you can see is that
you can actually find an n-- so
00:45:13.860 --> 00:45:16.300
regardless of whether
you're Python or C,
00:45:16.300 --> 00:45:20.020
this tells you that asymptotic
complexity is pretty important
00:45:20.020 --> 00:45:24.140
because, once n gets
beyond about 4,000,
00:45:24.140 --> 00:45:27.260
you're going to see that
merge sort in Python
00:45:27.260 --> 00:45:30.350
beats insertion sort in C.
00:45:30.350 --> 00:45:35.430
So the constant
factors get subsumed
00:45:35.430 --> 00:45:37.160
beyond certain values of n.
00:45:37.160 --> 00:45:39.835
So that's why asymptotic
complexity is important.
00:45:39.835 --> 00:45:41.210
You do have a
factor of 20, here,
00:45:41.210 --> 00:45:43.270
but that doesn't really
help you in terms
00:45:43.270 --> 00:45:47.440
of keeping an n square
algorithm competitive.
00:45:47.440 --> 00:45:49.400
It stays competitive
for a little bit longer,
00:45:49.400 --> 00:45:50.510
but then falls behind.
00:45:54.520 --> 00:45:57.387
That's what I wanted
to cover for sorting.
00:45:57.387 --> 00:45:58.970
So hopefully, you
have a sense of what
00:45:58.970 --> 00:46:02.040
happens with these two
sorting algorithms.
00:46:02.040 --> 00:46:05.200
We'll look at a very different
sorting algorithm next time,
00:46:05.200 --> 00:46:08.460
using heaps, which is a
different data structure.
00:46:08.460 --> 00:46:11.330
The last thing I want to do in
the couple minutes I have left
00:46:11.330 --> 00:46:14.810
is give you a little more
intuition as to recurrence
00:46:14.810 --> 00:46:18.680
solving based on this diagram
that I wrote up there.
00:46:18.680 --> 00:46:21.460
And so we're going to use
exactly this structure.
00:46:21.460 --> 00:46:24.250
And we're going to look at a
couple of different recurrences
00:46:24.250 --> 00:46:26.360
that I won't really
motivate in terms
00:46:26.360 --> 00:46:29.420
of having a specific
algorithm, but I'll just
00:46:29.420 --> 00:46:31.150
write out the recurrence.
00:46:31.150 --> 00:46:36.340
And we'll look at the
recursion tree for that.
00:46:36.340 --> 00:46:41.900
And I'll try and tease out of
you the complexity associated
00:46:41.900 --> 00:46:45.635
with these recurrences of
the overall complexity.
00:46:49.480 --> 00:46:58.000
So let's take a look at T
of n equals 2 T of n over 2
00:46:58.000 --> 00:47:00.310
plus c n squared.
00:47:02.820 --> 00:47:08.360
Let me just call that c--
no need for the brackets.
00:47:08.360 --> 00:47:10.970
So constant c times n squared.
00:47:10.970 --> 00:47:13.200
So if you had a
crummy merge routine,
00:47:13.200 --> 00:47:18.020
and it was taking n square,
and you coded it up wrong.
00:47:18.020 --> 00:47:20.050
It's not a great motivation
for this recurrence,
00:47:20.050 --> 00:47:23.980
but it's a way this
recurrence could have come up.
00:47:23.980 --> 00:47:27.470
So what does this
recursive tree look like?
00:47:27.470 --> 00:47:29.580
Well it looks kind of
the same, obviously.
00:47:29.580 --> 00:47:33.210
You have c n square; you
have c n square divided by 4;
00:47:33.210 --> 00:47:36.620
c n square divided by
4; c n square divided
00:47:36.620 --> 00:47:40.620
by 16, four times.
00:47:40.620 --> 00:47:44.460
Looking a little bit
different from the other one.
00:47:44.460 --> 00:47:47.560
The levels and the leaves
are exactly the same.
00:47:47.560 --> 00:47:49.720
Eventually n is going
to go down to 1.
00:47:49.720 --> 00:47:53.280
So you will see c
all the way here.
00:47:53.280 --> 00:47:54.735
And you're going
to have n leaves.
00:47:57.880 --> 00:48:03.380
And you will have, as
before, 1 plus log n levels.
00:48:03.380 --> 00:48:05.070
Everything is the same.
00:48:05.070 --> 00:48:07.590
And this is why I like this
recursive tree formulation so
00:48:07.590 --> 00:48:09.370
much because, now,
all I have to do
00:48:09.370 --> 00:48:14.710
is add up the work associated
with each of the levels
00:48:14.710 --> 00:48:17.100
to get the solution
to the recurrence.
00:48:17.100 --> 00:48:18.770
Now, take a look at
what happens, here.
00:48:18.770 --> 00:48:25.350
c n square; c n square divided
by 2; c n square divided by 4.
00:48:25.350 --> 00:48:27.890
And this is n times c.
00:48:30.890 --> 00:48:34.316
So what does that add up to?
00:48:34.316 --> 00:48:35.839
AUDIENCE: [INAUDIBLE]
00:48:35.839 --> 00:48:36.880
PROFESSOR: Yeah, exactly.
00:48:36.880 --> 00:48:37.920
Exactly right.
00:48:37.920 --> 00:48:40.430
So if you look at what
happens, here, this dominates.
00:48:44.340 --> 00:48:47.520
All of the other things are
actually less than that.
00:48:47.520 --> 00:48:49.250
And you said bounded
by two c n square
00:48:49.250 --> 00:48:51.420
because this part is
bounded by c n square
00:48:51.420 --> 00:48:54.490
and I already have c n
square up at the top.
00:48:54.490 --> 00:48:58.100
So this particular algorithm
that corresponds to this crummy
00:48:58.100 --> 00:49:02.300
merge sort, or wherever
this recurrence came from,
00:49:02.300 --> 00:49:06.700
is a theta n squared algorithm.
00:49:06.700 --> 00:49:10.520
And in this case,
all of the work done
00:49:10.520 --> 00:49:15.360
is at the root-- at the
top level of the recursion.
00:49:15.360 --> 00:49:17.650
Here, there was a
roughly equal amount
00:49:17.650 --> 00:49:21.630
of work done in each of
the different levels.
00:49:21.630 --> 00:49:26.610
Here, all of the work
was done at the root.
00:49:26.610 --> 00:49:29.460
And so to close
up shop, here, let
00:49:29.460 --> 00:49:34.210
me just give you real
quick a recurrence where
00:49:34.210 --> 00:49:40.470
all of the work is done at
the leaves, just for closure.
00:49:40.470 --> 00:49:45.770
So if I had, magically, a merge
routine that actually happened
00:49:45.770 --> 00:49:48.710
in constant time, either
through buggy analysis,
00:49:48.710 --> 00:49:51.890
or because of it
was buggy, then what
00:49:51.890 --> 00:49:55.650
does the tree look
like for that?
00:49:55.650 --> 00:49:58.280
And I can think of
this as being theta 1.
00:49:58.280 --> 00:50:01.156
Or I can think of this as
being just a constant c.
00:50:01.156 --> 00:50:02.030
I'll stick with that.
00:50:02.030 --> 00:50:05.246
So I have c, c, c.
00:50:09.890 --> 00:50:11.350
Woah, I tried to move that up.
00:50:11.350 --> 00:50:13.750
That doesn't work.
00:50:13.750 --> 00:50:15.545
So I have n leaves, as before.
00:50:18.314 --> 00:50:19.980
And so if I look at
what I have, here, I
00:50:19.980 --> 00:50:21.840
have c at the top level.
00:50:21.840 --> 00:50:25.870
I have 2c, and so
on and so forth.
00:50:25.870 --> 00:50:26.930
4c.
00:50:26.930 --> 00:50:30.940
And then I go all
the way down to nc.
00:50:30.940 --> 00:50:33.380
And so what happens
here is this dominates.
00:50:36.010 --> 00:50:41.600
And so, in this recurrence, the
whole thing runs in theta n.
00:50:41.600 --> 00:50:46.300
So the solution to
that is theta n.
00:50:46.300 --> 00:50:50.970
And what you have here
is all of the work
00:50:50.970 --> 00:50:54.450
being done at the leaves.
00:50:54.450 --> 00:50:58.440
We're not going to really cover
this theorem that gives you
00:50:58.440 --> 00:51:02.340
a mechanical way of figuring
this out because we think
00:51:02.340 --> 00:51:05.780
the recursive tree is a
better way of looking at.
00:51:05.780 --> 00:51:08.920
But you can see that, depending
on what that function is,
00:51:08.920 --> 00:51:12.130
in terms of the work being
done in the merge routine,
00:51:12.130 --> 00:51:14.490
you'd have different
versions of recurrences.
00:51:14.490 --> 00:51:16.990
I'll stick around, and people
who answered questions, please
00:51:16.990 --> 00:51:18.270
pick up you cushions.
00:51:18.270 --> 00:51:20.240
See you next time.