Topics covered: Linear-time Sorting: Lower Bounds, Counting Sort, Radix Sort
Instructors: Prof. Erik Demaine, Prof. Charles Leiserson
Transcript - Lecture 5
Today we're going to talk about sorting, which may not come as such a big surprise. We talked about sorting for a while, but we're going to talk about it at a somewhat higher level and question some of the assumptions that we've been making so far. And we're going to ask the question how fast can we sort? A pretty natural question. You may think you know the answer. Perhaps you do. Any suggestions on what the answer to this question might be? There are several possible answers. Many of them are partially correct. Let's hear any kinds of answers you'd like and start waking up this fresh morning. Sorry?
Theta n log n. That's a good answer. That's often correct. Any other suggestions? N squared. That's correct if all you're allowed to do is swap adjacent elements. Good. That was close. I will see if I can make every answer correct. Usually n squared is not the right answer, but in some models it is. Yeah? Theta n is also sometimes the right answer. The real answer is "it depends". That's the point of today's lecture. It depends on what we call the computational model, what you're allowed to do. And, in particular here, with sorting, what we care about is the order of the elements, how are you allowed to manipulate the elements, what are you allowed to do with them and find out their order. The model is what you can do with the elements.
Now, we've seen several sorting algorithms. Do you want to shout some out? I think we've seen four, but maybe you know even more algorithms. Quicksort. Keep going. Heapsort. Merge sort. You can remember all the way back to Lecture 1. Any others? Insertion sort. All right. You're on top of it today. I don't know exactly why, but these two are single words and these two are two words. That's the style. What is the running time of quicksort?
This is a bit tricky. N log n in the average case. Or, if we randomize quicksort, randomized quicksort runs in n log n expected for any input sequence. Let's say n lg n randomized. That's theta. And the worst-case with plain old quicksort where you just pick the first element as the partition element. That's n^2. Heapsort, what's the running time there? n lg n always. Merge sort, I hope you can remember that as well, n lg n. And insertion sort? n^2. All of these algorithms run no faster than n lg n, so we might ask, can we do better than n lg n?
And that is a question, in some sense, we will answer both yes and no to today. But all of these algorithms have something in common in terms of the model of what you're allowed to do with the elements. Any guesses on what that model might be? Yeah? You compare pairs of elements, exactly. That is indeed the model used by all four of these algorithms. And in that model n lg n is the best you can do. We have so far just looked at what are called comparison sorting algorithms or "comparison sorts". And this is a model for the sorting problem of what you're allowed to do.
Here all you can do is use comparisons meaning less than, greater than, less than or equal to, greater than or equal to, equals to determine the relative order of elements. This is a restriction on algorithms. It is, in some sense, stating what kinds of elements we're dealing with. They are elements that we can somehow compare. They have a total order, some are less, some are bigger. But is also restricts the algorithm. You could say, well, I'm sorting integers, but still I'm only allowed to do comparisons with them. I'm not allowed to multiply the integers or do other weird things. That's the comparison sorting model. And this lecture, in some sense, follows the standard mathematical progression where you have a theorem, then you have a proof, then you have a counter example.
It's always a good way to have a math lecture. We're going to prove the theorem that no comparison sorting algorithm runs better than n lg n. Comparisons. State the theorem, prove that, and then we'll give a counter example in the sense that if you go outside the comparison sorting model you can do better, you can get linear time in some cases, better than n lg n. So, that is what we're doing today.
But first we're going to stick to this comparison model and try to understand why we need n lg n comparisons if that's all we're allowed to do. And for that we're going to look at something called decision trees, which in some sense is another model of what you're allowed to do in an algorithm, but it's more general than the comparison model. And let's try and example to get some intuition. Suppose we want to sort three elements. This is not very challenging, but we'll get to draw the decision tree that corresponds to sorting three elements. Here is one solution I claim.
This is, in a certain sense, an algorithm, but it's drawn as a tree instead of pseudocode. What this tree means is that each node you're making a comparison. This says compare a_1 versus a_2. If a_1 is smaller than a_2 you go this way, if it is bigger than a_2 you go this way, and then you proceed. When you get down to a leaf, this is the answer. Remember, the sorting problem is you're trying to find a permutation of the inputs that puts it in sorted order. Let's try it with some sequence of numbers, say 9, 4 and 6. We want to sort 9, 4 and 6, so first we compare the first element with the second element. 9 is bigger than 4 so we go down this way.
Then we compare the first element with the third element, that's 9 versus 6. 9 is bigger than 6, so we go this way. And then we compare the second element with the third element, 4 is less than 6 and, so we go this way. And the claim is that this is the correct permutation of the elements. You take a_2, which is 4, then you take a_3, which is 6, and then you take a_1, which is 9, so indeed that works out. And if I wrote this down right, this is a sorting algorithm in the decision tree model.
In general, let me just say the rules of this game. In general, we have n elements we want to sort. And I only drew the n = 3 case because these trees get very big very quickly. Each internal node, so every non-leaf node, has a label of the form i : j where i and j are between 1 and n. And this means that we compare a_i with a_j. And we have two subtrees from every such node. We have the left subtree which tells you what the algorithm does, what subsequent comparisons it makes if it comes out less than.
And we have to be a little bit careful because it could also come out equal. What we will do is the left subtree corresponds to less than or equal to and the right subtree corresponds to strictly greater than. That is a little bit more precise than what we were doing here. Here all the elements were distinct so no problem. But, in general, we care about the equality case too to be general.
So, that was the internal nodes. And then each leaf node gives you a permutation. So, in order to be the answer to that sorting problem, that permutation better have the property that it orders the elements. This is from the first lecture when we defined the sorting problem. Some permutation on n things such that a_pi(1) is less than or equal to a_pi(2) and so on. So, that is the definition of a decision tree. Any binary tree with these kinds of labels satisfies all these properties. That is, in some sense, a sorting algorithm. It's a sorting algorithm in the decision tree model.
Now, as you might expect, this is really not too different than the comparison model. If I give you a comparison sorting algorithm, we have these four, quicksort, heapsort, merge sort and insertion sort. All of them can be translated into the decision tree model. It's sort of a graphical representation of what the algorithm does. It's not a terribly useful one for writing down an algorithm. Any guesses why?
Why do we not draw these pictures as a definition of quicksort or a definition of merge sort? It depends on the size of the input, that's a good point. This tree is specific to the value of n, so it is, in some sense, not as generic. Now, we could try to write down a construction for an arbitrary value of n of one of these decision trees and that would give us sort of a real algorithm that works for any input size.
But even then this is not a terribly convenient representation for writing down an algorithm. Well, let's write down a transformation that converts a comparison sorting algorithm to a decision tree and then maybe you will see why. This is not a useless model, obviously, I wouldn't be telling you otherwise. It will be very powerful for proving that we cannot do better than n lg n, but as writing down an algorithm, if you were going to implement something, this tree is not so useful.
Even if you had a decision tree computer, whatever that is. But let's prove this theorem that decision trees, in some sense, model comparison sorting algorithms, which we call just comparison sorts. This is a transformation. And we're going to build one tree for each value of n. The decision trees depend on n. The algorithm hopefully, well, it depends on n, but it works for all values of n. And we're just going to think of the algorithm as splitting into two forks, the left subtree and the right subtree whenever it makes a comparison.
If we take a comparison sort like merge sort. And it does lots of stuff. It does index arithmetic, it does recursion, whatever. But at some point it makes a comparison and then we say, OK, there are two halves of the algorithm. There is what the algorithm would do if the comparison came out less than or equal to and what the algorithm would do if the comparison came out greater than. So, you can build a tree in this way. In some sense, what this tree is doing is listing all possible executions of this algorithm considering what would happen for all possible values of those comparisons.
We will call these all possible instruction traces. If you write down all the instructions that are executed by this algorithm, for all possible input arrays, a_1 to a_n, see what all the comparisons, how they could come and what the algorithm does, in the end you will get a tree. Now, how big will that tree be roughly? As a function of n. Yeah? Right. If it's got to be able to sort every possible list of length n, at the leaves I have to have all the permutations of those elements.
That is a lot. There are a lot of permeations on n elements. There's n factorial of them. N factorial is exponential, it's really big. So, this tree is huge. It's going to be exponential on the input size n. That is why we don't write algorithms down normally as a decision tree, even though in some cases maybe we could. It's not a very compact representation. These algorithms, you write them down in pseudocode, they have constant length. It's a very succinct representation of this algorithm. Here the length depends on n and it depends exponentially on n, which is not useful if you wanted to implement the algorithm because writing down the algorithm would take a long time. But, nonetheless, we can use this as a tool to analyze these comparison sorting algorithms. We have all of these. Any algorithm can be transformed in this way into a decision tree.
And now we have this observation that the number of leaves in this decision tree has to be really big. Let me talk about leaves in a second. Before we get to leaves, let's talk about the depth of the tree. This decision tree represents all possible executions of the algorithm. If I look at a particular execution, which corresponds to some root to leaf path in the tree, the running time or the number of comparisons made by that execution is just the length of the path.
And, therefore, the worst-case running time, over all possible inputs of length n, is going to be -- n - 1? Could be. Depends on the decision tree. But, as a function of the decision tree? The longest path, right, which is called the height of the tree. So, this is what we want to measure. We want to claim that the height of the tree has to be at least n lg n with an omega in front. That is what we'll prove.
And the only thing we're going to use is that the number of leaves in that tree has to be big, has to be n factorial. This is a lower bound on decision tree sorting. And the lower bound says that if you have any decision tree that sorts n elements then its height has to be at least n lg n up to constant factors. So, that is the theorem. Now we're going to prove the theorem. And we're going to use that the number of leaves in that tree must be at least n factorial.
Because there are n factorial permutations of the inputs. All of them could happen. And so, for this algorithm to be correct, it has detect every one of those permutations in some way. Now, it may do it very quickly. We better only need n lg n comparisons because we know that's possible. The depth of the tree may not be too big, but it has to have a huge number of leaves down there. It has to branch enough to get n factorial leaves because it has to give the right answer in possible inputs. This is, in some sense, counting the number of possible inputs that we have to distinguish. This is the number of leaves. What we care about is the height of the tree. Let's call the height of the tree h.
Now, if I have a tree of height h, how many leaves could it have? What's the maximum number of leaves it could have? 2^h, exactly. Because this is binary tree, comparison trees always have a branching factor of 2, the number of leaves has to be at most 2^h, if I have a height h tree. Now, this gives me a relation. The number of leaves has to be greater than or equal to n factorial and the number of leaves has to be less than or equal to 2^h. Therefore, n factorial is less than or equal to 2^h, if I got that right.
Now, again, we care about h in terms of n factorial, so we solve this by taking logs. And I am also going to flip sides. Now h is at least log base 2, because there is a 2 over here, of n factorial. There is a property that I'm using here in order to derive this inequality from this inequality. This is a technical aside, but it's important that you realize there is a technical issue here.
The general principle I'm applying is I have some inequality, I do the same thing to both sides, and hopefully that inequality should still be true. But, in order for that to be the case, I need a property about that operation that I'm performing. It has to be a monotonic transformation. Here what I'm using is that log is a monotonically increasing function. That is important. If I multiply both sides by -1, which is a decreasing function, the inequality would have to get flipped. The fact that the inequality is not flipping here, I need to know that log is monotonically increasing. If you see log that's true. We need to be careful here.
Now we need some approximation of n factorial in order to figure out what its log is. Does anyone know a good approximation for n factorial? Not necessarily the equation but the name. Stirling's formula. Good. You all remember Stirling. And I just need the highest order term, which I believe is that. N factorial is at least (n/e)^n. So, that's all we need here. Now I can use properties of logs to bring the n outside. This is n lg (n/e).
And then lg (n/e) I can simplify. That is just lg n - lg e. So, this is n(lg n - lg e). Lg e is a constant, so it's really tiny compared to this lg n which is growing within. This is Omega(n lg n). All we care about is the leading term. It is actually Theta(n lg n), but because we have it greater than or equal to all we care about is the omega. A theta here wouldn't give us anything stronger.
Of course, not all algorithms have n lg n running time or make n lg n comparisons. Some of them do, some of them are worse, but this proves that all of them require a height of at least n lg n. There you see proof, once you observe the fact about the number of leaves, and if you remember Stirling's formula. So, you should know this proof. You can show that all sorts of problems require n lg n time with this kind of technique, provided you're in some kind of a decision tree model.
That's important. We really need that our algorithm can be phrased as a decision tree. And, in particular, we know from this transformation that all comparison sorts can be represented as the decision tree. But there are some sorting algorithms which cannot be represented as a decision tree. And we will turn to that momentarily. But before we get there I phrased this theorem as a lower bound on decision tree sorting. But, of course, we also get a lower bound on comparison sorting.
And, in particular, it tells us that merge sort and heapsort are asymptotically optimal. Their dependence on n, in terms of asymptotic notation, so ignoring constant factors, these algorithms are optimal in terms of growth of n, but this is only in the comparison model. So, among comparison sorting algorithms, which these are, they are asymptotically optimal. They use the minimum number of comparisons up to constant factors. In fact, their whole running time is dominated by the number of comparisons. It's all Theta(n lg n). So, this is good news. And I should probably mention a little bit about what happens with randomized algorithms. What I've described here really only applies, in some sense, to deterministic algorithms.
Does anyone see what would change with randomized algorithms or where I've assumed that I've had a deterministic comparison sort? This is a bit subtle. And I only noticed it reading the notes this morning, oh, wait. I will give you a hint. It's over here, the right-hand side of the world. If I have a deterministic algorithm, what the algorithm does is completely determinate at each step.
As long as I know all the comparisons that it made up to some point, it's determinate what that algorithm will do. But, if I have a randomized algorithm, it also depends on the outcomes of some coin flips. Any suggestions of what breaks over here? There is more than one tree, exactly. So, we had this assumption that we only have one tree for each n. In fact, what we get is a probability distribution over trees. For each value of n, if you take all the possible executions of that algorithm, all the instruction traces, well, now, in addition to branching on comparisons, we also branch on whether a coin flip came out heads or tails, or however we're generating random numbers it came out with some value between 1 and n. So, we get a probability distribution over trees. This lower bound still applies, though.
Because, no matter what tree we get, I don't really care. I get at least one tree for each n. And this proof applies to every tree. So, no matter what tree you get, if it is a correct tree it has to have height Omega(n lg n). This lower bound applies even for randomized algorithms. You cannot get better than n lg n, because no matter what tree it comes up with, no matter how those coin flips come out, this argument still applies.
Every tree that comes out has to be correct, so this is really at least one tree. And that will now work. We also get the fact that randomized quicksort is asymptotically optimal in expectation. But, in order to say that randomized quicksort is asymptotically optimal, we need to know that all randomized algorithms require Omega(n lg n) comparisons. Now we know that so all is well. That is the comparison model. Any questions before we go on? Good.
The next topic is to burst outside of the comparison model and try to sort in linear time. It is pretty clear that, as long as you don't have some kind of a parallel algorithm or something really fancy, you cannot sort any better than linear time because you've at least got to look at the data. No matter what you're doing with the data, you've got to look at it, otherwise you're not sorting it correctly.
So, linear time is the best we could hope for. N lg n is pretty close. How could we sort in linear time? Well, we're going to need some more powerful assumption. And this is the counter example. We're going to have to move outside the comparison model and do something else with our elements. And what we're going to do is assume that they're integers in a particular range, and we will use that to sort in linear time. We're going to see two algorithms for sorting faster than n lg n.
The first one is pretty simple, and we will use it in the second algorithm. It's called counting sort. The input to counting sort is an array, as usual, but we're going to assume what those array elements look like. Each A[i] is an integer from the range of 1 to k. This is a pretty strong assumption. And the running time is actually going to depend on k. If k is small it is going to be a good algorithm. If k is big it's going to be a really bad algorithm, worse than n lg n. Our goal is to output some sorted version of this array. Let's call this sorting of A. It's going to be easier to write down the output directly instead of writing down permutation for this algorithm.
And then we have some auxiliary storage. I'm about to write down the pseudocode, which is why I'm declaring all my variables here. And the auxiliary storage will have length k, which is the range on my input values. Let's see the algorithm. This is counting sort. And it takes a little while to write down but it's pretty straightforward. First we do some initialization. Then we do some counting.
Then we do some summing. And then we actually write the output. Is that algorithm perfectly clear to everyone? No one. Good. This should illustrate how obscure pseudocode can be. And when you're solving your problem sets, you should keep in mind that it's really hard to understand an algorithm just given pseudocode like this. You need some kind of English description of what's going on because, while you could work through and figure out what this means, it could take half an hour to an hour. And that's not a good way of expressing yourself. And so what I will give you now is the English description, but we will refer back to this to understand.
This is sort of our bible of what the algorithm is supposed to do. Let me go over it briefly. The first step is just some initialization. The C[i]'s are going to count some things, count occurrences of values. And so first we set them to zero. Then, for every value we see A[j], we're going to increment the counter for that value A[j]. Then the C[i]s will give me the number of elements equal to a particular value i.
Then I'm going to take prefix sums, which will make it so that C[i] gives me the number of keys, the number of elements less than or equal to [i] instead of equals. And then, finally, it turns out that's enough to put all the elements in the right place. This I will call distribution. This is the distribution step. And it's probably the least obvious of all the steps. And let's do an example to make it more obvious what's going on.
Let's take an array A = [4, 1, 3, 4, 3]. And then I want some array C. And let me add some indices here so we can see what the algorithm is really doing. Here it turns out that all of my numbers are in the range 1 to 4, so k = 4. My array C has four values. Initially, I set them all to zero. That's easy. And now I want to count through everything. And let me not cheat here. I'm in the second step, so to speak. And I look for each element in order. I look at the C[i] value. The first element is 4, so I look at C4. That is 0. I increment it to 1. Then I look at element 1. That's 0. I increment it to 1.
Then I look at 3 and that's here. It is also 0. I increment it to 1. Not so exciting so far. Now I see 4, which I've seen before, how exciting. I had value 1 in here, I increment it to 2. Then I see value 3, which also had a value of 1. I increment that to 2. The result is [1, 0, 2, 2]. That's what my array C looks like at this point in the algorithm. Now I do a relatively simple transformation of taking prefix sums. I want to know, instead of these individual values, the sum of this prefix, the sum of this prefix, the sum of this prefix and the sum of this prefix. I will call that C prime just so we don't get too lost in all these different versions of C. This is just 1. And 1 plus 0 is 1. 1 plus 2 is 3. 3 plus 2 is 5. So, these are sort of the running totals.
There are five elements total, there are three elements less than or equal to 3, there is one element less than or equal to 2, and so on. Now, the fun part, the distribution. And this is where we get our array B. B better have the same size, every element better appear here somewhere and they should come out in sorted order. Let's just run the algorithm. j is going to start at the end of the array and work its way down to 1, the beginning of the array.
And what we do is we pick up the last element of A, A[n]. We look at the counter. We look at the C vector for that value. Here the value is 3, and this is the third column, so that has number 3. And the claim is that's where it belongs in B. You take this number 3, you put it in index 3 of the array B. And then you decrement the counter. I'm going to replace 3 here with 2. And the idea is these numbers tell you where those values should go. Anything of value 1 should go at position 1. Anything with value 3 should go at position 3 or less. This is going to be the last place that a 3 should go.
And then anything with value 4 should go at position 5 or less, definitely should go at the end of the array because 4 is the largest value. And this counter will work out perfectly because these counts have left enough space in each section of the array. Effectively, this part is reserved for ones, there are no twos, this part is reserved for threes, and this part is reserved for fours. You can check if that's really what this array means. Let's finish running the algorithm. That was the last element.
I won't cross it off, but we've sort of done that. Now I look at the next to last element. That's a 4. Fours go in position 5. So, I put my 4 here in position 5 and I decrement that counter. Next I look at another 3. Threes now go in position 2, so that goes there. And then I decrement that counter. I won't actually use that counter anymore, but let's decrement it because that's what the algorithm says. I look at the previous element. That's a 1.
Ones go in position 1, so I put it here and decrement that counter. And finally I have another 4. And fours go in position 4 now, position 4 is here, and I decrement that counter. So, that's counting sort. And you'll notice that all the elements appear and they appear in order, so that's the algorithm. Now, what's the running time of counting sort? kn is an upper bound. It's a little bit better than that. Actually, quite a bit better. This requires some summing. Let's go back to the top of the algorithm. How much time does this step take? k. How much time does this step take?
n. How much time does this step take? k. Each of these operations in the for loops is taking constant time, so it is how many iterations of that for loop are there? And, finally, this step takes n. So, the total running time of counting sort is k + n. And this is a great algorithm if k is relatively small, like at most n. If k is big like n^2 or 2^n or whatever, this is not such a good algorithm, but if k = O(n) this is great.
And we get our linear time sorting algorithm. Not only do we need the assumption that our numbers are integers, but we need that the range of the integers is pretty small for this algorithm to work. If all the numbers are between 1 and order n then we get a linear time algorithm. But as soon as they're up to n lg n we're toast. We're back to n lg n sorting. It's not so great. So, you could write a combination algorithm that says, well, if k is bigger than n lg n, then I will just use merge sort.
And if it's less than n lg n I'll use counting sort. And that would work, but we can do better than that. How's the time? It is worth noting that we've beaten our bound, but only assuming that we're outside the comparison model. We haven't really contradicted the original theorem, we're just changing the model. And it's always good to question what you're allowed to do in any problem scenario. In, say, some practical scenarios, this would be great if the numbers you're dealing with are, say, a byte long. Then k is only 2^8, which is 256. You need this auxiliary array of size 256, and this is really fast. 256 + n, no matter how big n is it's linear in n. If you know your numbers are small, it's great. But if you're numbers are bigger, say you still know they're integers but they fit in like 32 bit words, then life is not so easy.
Because k is then 2^32, which is 4.2 billion or so, which is pretty big. And you would need this auxiliary array of 4.2 billion words, which is probably like 16 gigabytes. So, you just need to initialize that array before you can even get started. Unless n is like much, much more than 4 billion and you have 16 gigabytes of storage just to throw away, which I don't even have any machines with 16 gigabytes of RAM, this is not such a great algorithm.
Just to get a feel, it's good, the numbers are really small. What we're going to do next is come up with a fancier algorithm that uses this as a subroutine on small numbers and combines this algorithm to handle larger numbers. That algorithm is called radix sort. But we need one important property of counting sort before we can go there. And that important property is stability. A stable sorting algorithm preserves the order of equal elements, let's say the relative order.
This is a bit subtle because usually we think of elements just as numbers. And, yeah, we had a couple threes and we had a couple fours. It turns out, if you look at the order of those threes and the order of those fours, we kept them in order. Because we took the last three and we put it here. Then we took the next to the last three and we put it to the left of that where O is decrementing our counter and moving from the end of the array to the beginning of the array. No matter how we do that, the orders of those threes are preserved, the orders of the fours are preserved. This may seem like a relatively simple thing, but if you look at the other four sorting algorithms we've seen, not all of them are stable. So, this is an exercise.
Exercise is figure out which other sorting algorithms that we've seen are stable and which are not. I encourage you to work that out because this is the sort of thing that we ask on quizzes. But for now all we need is that counting sort is stable. And I won't prove this, but it should be pretty obvious from the algorithm. Now we get to talk about radix sort. Radix sort is going to work for a much larger range of numbers in linear time.
Still it has to have an assumption about how big those numbers are, but it will be a much more lax assumption. Now, to increase suspense even further, I am going to tell you some history about radix sort. This is one of the oldest sorting algorithms. It's probably the oldest implemented sorting algorithm. It was implemented around 1890. This is Herman Hollerith. Let's say around 1890. Has anyone heard of Hollerith before? A couple people. Not too many. He is sort of an important guy. He was a lecturer at MIT at some point. He developed an early version of punch cards. Punch card technology. This is before my time so I even have to look at my notes to remember. Oh, yeah, they're called punch cards. You may have seen them.
If not they're in the PowerPoint lecture notes. There's this big grid. These days, if you've used a modern punch card recently, they are 80 characters wide and, I don't know, I think it's something like 16, I don't remember exactly. And then you punch little holes here. You have this magic machine. It's like a typewriter. You press a letter and that corresponds to some character. Maybe it will punch out a hole here, punch out a hole here. You can see the website if you want to know exactly how this works for historical reasons. You don't see these too often anymore, but this is in particular the reason why most terminals are 80 characters wide because that was how things were. Hollerith actually didn't develop these punch cards exactly, although eventually he did.
In the beginning, in 1890, the big deal was the US Census. If you watched the news, I guess like a year or two ago, the US Census was a big deal because it's really expensive to collect all this data from everyone. And the Constitution says you've got to collect data about everyone every ten years. And it was getting hard. In particular, in 1880, they did the census. And it took them almost ten years to complete the census. The population kept going up, and ten years to do a ten-year census, that's going to start getting expensive when they overlap with each other.
So, for 1890 they wanted to do something fancier. And Hollerith said, OK, I'm going to build a machine that you take in the data. It was a modified punch card where you would mark out particular squares depending on your status, whether you were single or married or whatever. All the things they wanted to know on the census they would encode in binary onto this card. And then he built a machine that would sort these cards so you could do counting.
And, in some sense, these are numbers. And the numbers aren't too big, but they're big enough that counting sort wouldn't work. I mean if there were a hundred numbers here, 2^100 is pretty overwhelming, so we cannot use counting sort. The first idea was the wrong idea. I'm going to think of these as numbers. Let's say each of these columns is one number. And so there's sort of the most significant number out here and there is the least significant number out here. The first idea was you sort by the most significant digit first.
That's not such a great algorithm, because if you sort by the most significant digit you get a bunch of buckets each with a pile of cards. And this was a physical device. It wasn't exactly an electronically controlled computer. It was a human that would push down some kind of reader. It would see which holes in the first column are punched. And then it would open a physical bin in which the person would sort of swipe it and it would just fall into the right bin. It was a semi-automated. I mean the computer was the human plus the machine, but never mind. This was the procedure. You sorted it into bins. Then you had to go through and sort each bin by the second digit.
And pretty soon the number of bins gets pretty big. And if you don't have too many digits this is OK, but it's not the right thing to do. The right idea, which is what Hollerith came up with after that, was to sort by the least significant digit first. And you should also do that using a stable sorting algorithm. Now, Hollerith probably didn't call it a stable sorting algorithm at the time, but we will. And this won Hollerith lots of money and good things. He founded this tabulating machine company in 1911, and that merged with several other companies to form something you may have heard of called IBM in 1924. That may be the context in which you've heard of Hollerith, or if you've done punch cards before.
The whole idea is that we're doing a digit by digit sort. I should have mentioned that at the beginning. And we're going to do it from least significant to most significant. It turns out that works. And to see that let's do an example. I think I'm going to need a whole two boards ideally. First we'll see an example. Then we'll prove the theorem. The proof is actually pretty darn easy. But, nonetheless, it's rather counterintuitive this works if you haven't seen it before. Certainly, the first time I saw it, it was quite a surprise. The nice thing also about this algorithm is there are no bins. It's all one big bin at all times. Let's take some numbers. 329. This is a three digit number. I'm spacing out the digits so we can see them a little bit better. 457.
657, 839, 436, 720 and 355. I'm assuming here we're using decimal numbers. Why not? Hopefully this are not yet sorted. We'd like to sort them. The first thing we do is take the least significant digit, sort by the least significant digit. And whenever we have equal elements like these two nines, we preserve their relative order. So, 329 is going to remain above 839. It doesn't matter here because we're doing the first sort, but in general we're always using a stable sorting algorithm. When we sort by this column, first we get the zero, so that's 720, then we get 5, 355.
Then we get 6, 436. Stop me if I make a mistake. Then we get the 7s, and we preserve the order. Here it happens to be the right order, but it may not be at this point. We haven't even looked at the other digits. Then we get 9s, there are two 9s, 329 and 839. All right so far? Good. Now we sort by the middle digit, the next least significant. And we start out with what looks like the 2s. There is a 2 up here and a 2 down here. Of course, we write the first 2 first, 720, then 329. Then we have the 3s, so we have 436 and 839.
Then we have a bunch of 5s it looks like. Have I missed anyone so far? No. Good. We have three 5s, 355, 457 and 657. I like to check that I haven't lost any elements. We have seven here, seven here and seven elements here. Good. Finally, we sort by the last digit. One thing to notice, by the way, is before we sorted by the last digit -- Currently these numbers don't resemble sorted order at all. But if you look at everything beyond the digit we haven't yet sorted, so these two digits, that's nice and sorted, 20, 29, 36, 39, 55, 57, 57. Pretty cool. Let's finish it off. We stably sort by the first digit. And the smallest number we get is a 3, so we get 329 and then 355.
Then we get some 4s, 436 and 457, then we get a 6, 657, then a 7, and then we have an 8. And check. I still have seven elements. Good. I haven't lost anyone. And, indeed, they're now in sorted order. And you can start to see why this is working. When I have equal elements here, I have already sorted the suffix. Let's write down a proof of that. What is nice about this algorithm is we're not partitioning into bins. We always keep the huge batch of elements in one big pile, but we're just going through it multiple times. In general, we sort of need to go through it multiple times. Hopefully not too many times.
But let's first argue correctness. To analyze the running time is a little bit tricky here because it depends how you partition into digits. Correctness is easy. We just induct on the digit position that we're currently sorting, so let's call that t. And we can assume by induction that it's sorted beyond digit t. This is our induction hypothesis. We assume that we're sorted on the low-order t - 1 digits. And then the next thing we do is sort on the t-th digit. We just need to check that things work. And we restore the induction hypothesis for t instead of t - 1.
When we sort on the t-th digit there are two cases. If we look at any two elements, we want to know whether they're put in the right order. If two elements are the same, let's say they have the same t-th digit -- This is the tricky case. If they have the same t-th digit then their order should not be changed. So, by stability, we know that they remain in the same order because stability is supposed to preserve things that have the same key that we're sorting on. And then, by the induction hypothesis, we know that that keeps them in sorted order because induction hypothesis says that they used to be sorted.
Adding on this value in the front that's the same in both doesn't change anything so they remain sorted. And if they have differing t-th digits -- -- then this sorting step will put them in the right order. Because that's what sorting does. This is the most significant digit, so you've got to order them by the t-th digit if they differ. The rest are irrelevant. So, proof here of correctness is very simple once you know the algorithm. Any questions before we go on? Good. We're going to use counting sort.
We could use any sorting algorithm we want for individual digits, but the only algorithm that we know that runs in less than n lg n time is counting sort. So, we better use that one to sort of bootstrap and get an even faster and more general algorithm. I just erased the running time. Counting sort runs in order k + n time. We need to remember that. And the range of the numbers is 1 to k or 0 to k - 1. When we sort by a particular digit, we shouldn't use n lg n algorithm because then this thing will take n lg n for one round and it's going to have multiple rounds. That's going to be worse than n lg n. We're going to use counting sort for each round.
We use counting sort for each digit. And we know the running time of counting sort here is order k + n . But I don't want to assume that my integers are split into digits for me. That's sort of giving away too much flexibility. Because if I have some number written in whatever form it is, probably written in binary, I can cluster together some of those bits and call that a digit. Let's think of our numbers as binary. Suppose we have n integers. And they're in some range. And we want to know how big a range they can be. Let's say, a sort of practical way of thinking, you know, we're in a binary world, each integer is b bits long.
So, in other words, the range is from 0 to 2b - 1. I will assume that my numbers are non-negative. It doesn't make much difference if they're negative, too. I want to know how big a b I can handle, but I don't want to split into bits as my digits because then I would have b digits and I would have to do b rounds of this algorithm. The number of rounds of this algorithm is the number of digits that I have. And each one costs me, let's hope, for linear time. And, indeed, if I use a single bit, k = 2. And so this is order n. But then the running time would be order n per round. And there are b digits, if I consider them to be bits, order n times b time. And even if b is something small like log n, if I have log n bits, then these are numbers between 0 and n - 1.
I already know how to sort numbers between 0 and n - 1 in linear time. Here I'm spending n lg n time, so that's no good. Instead, what we're going to do is take a bunch of bits and call that a digit, the most bits we can handle with counting sort. The notation will be I split each integer into b/r digits. Each r bits long. In other words, I think of my number as being in base 2^r. And I happen to be writing it down in binary, but I cluster together r bits and I get a bunch of digits in base 2^r. And then there are b/ r digits. This b/r is the number of rounds. And this base --
This is the maximum value I have in one of these digits. It's between 0 and 2^r. This is, in some sense, k for a run of counting sort. What is the running time? Well, I have b/r rounds. It's b/r times the running time for a round. Which I have n numbers and my value of k is 2^r. This is the running time of counting sort, n + k, this is the number of rounds, so this is b/r (n+2^r). And I am free to choose r however I want. What I would like to do is minimize this run time over my choices of r.
Any suggestions on how I might find the minimum running time over all choices of r? Techniques, not necessarily solutions. We're not used to this because it's asymptomatic, but forget the big O here. How do I minimize a function with respect to one variable? Take the derivative, yeah. I can take the derivative of this function by r, differentiate by r, set the derivative equal to 0, and that should be a critical point in this function. It turns out this function is unimodal in r and you will find the minimum. We could do that. I'm not going to do it because it takes a little bit more work. You should try it at home. It will give you the exact minimum, which is good if you know what this constant is.
Differentiate with respect to r and set to 0. I am going to do it a little bit more intuitively, in other words less precisely, but I will still get the right answer. And definitely I will get an upper bound because I can choose r to be whatever I want. It turns out this will be the right answer. Let's just think about growth in terms of r. There are essentially two terms here. I have b/r(n) and I have b/r(2^r). Now, b/r(n) would like r to be as big as possible. The bigger r is the number of rounds goes down. This number in front of n, this coefficient in front of n goes down, so I would like r to be big. So, b/r(n) wants r big. However, r cannot be too big. This is saying I want digits that have a lot of bits in them. It cannot be too big because there's 2^r term out here.
If this happens to be bigger than n then this will dominate in terms of growth of r. This is going to be b times 2 to the r over r. 2 the r is much, much bigger than r, so it's going to grow much faster is what I mean. And so I really don't want r to be too big for this other term. So, that is b/4(2^r) wants r small. Provided that this term is bigger or equal to this term then I can set r pretty big for that term. What I want is the n to dominate the 2^r. Provided I have that then I can set r as large as I want. Let's say I want to choose r to be maximum subject to this condition that n is greater than or equal to 2^r.
This is an upper bound to 2^r, and upper bound on r. In other words, I want r = lg n. This turns out to be the right answer up to constant factors. There we go. And definitely choosing r to be lg n will give me an upper bound on the best running time I could get because I can choose it to be whatever I want. If you differentiate you will indeed get the same answer. This was not quite a formal argument but close, because the big O is all about what grows fastest. If we plug in r = lg n we get bn/lg n. The n and the 2^r are equal, that's a factor of 2, 2 times n, not a big deal.
It comes out into the O. We have bn/lg n which is r. We have to think about what this means and translate it in terms of range. b was the number of bits in our number, which corresponds to the range of the number. I've got 20 minutes under so far in lecture so I can go 20 minutes over, right? No, I'm kidding. Almost done. Let's say that our numbers, are integers are in the range, we have 0 to 2^b, I'm going to say that it's range 0 to nd. This should be a -1 here.
If I have numbers that are between 0 and n^d - 1 where d is a constant or d is some parameter, so this is a polynomial in n, then you work out this running time. It is order dn. This is the way to think about it because now we can compare to counting sort. Counting sort could handle 0 up to some constant times d in linear time. Now I can handle 0 up to n to some constant power in linear time. This is if d = order 1 then we get a linear time sorting algorithm. And that is cool as long as d is at most lg n. As long as your numbers are at most n lg n then we have something that beats our n lg n sorting algorithms. And this is pretty nice.
Whenever you know that your numbers are order log end bits long we are happy, and you get some smooth tradeoff there. For example, if we have our 32 bit numbers and we split into let's say eight bit chunks then we'll only have to do four rounds each linear time and we have just 256 working space. We were doing four rounds for 32 bit numbers. If you use n lg n algorithm, you're going to be doing lg n rounds through your numbers.
n is like 2000, and that's at least 11 rounds for example. You would think this algorithm is going to be much faster for small numbers. Unfortunately, counting sort is not very good on a cache. In practice, rating sort is not that fast an algorithm unless your numbers are really small. Something like quicksort can do better. It's sort of shame, but theoretically this is very beautiful. And there are contexts where this is really the right way to sort things.
I will mention finally that if you have arbitrary integers that are one word length long. Here we're assuming that there are b bits in a word and we have some depends indirectly on b here. But, in general, if you have a bunch of integers and they're one word length long, and you can manipulate a word in constant time, then the best algorithm we know for sorting runs in n times square root of lg lg n time expected.
It is a randomized algorithm. We're not going to cover that algorithm in this class. It's rather complicated. I didn't even cover it in Advanced Algorithms when I taught it. If you want something easier, you can get n times square root of lg lg n time worst-case. And that paper is almost readable. I have taught that in Advanced Algorithms. If you're interested in this kind of stuff, take Advanced Algorithms next fall.
It's one of the follow-ons to this class. These are much more complicated algorithms, but it gives you some sense. You can even break out of the dependence on b, as long as you know that b is at most a word. And I will stop there unless there are any questions. Then see you Wednesday.
This is one of over 2,200 courses on OCW. Find materials for this course in the pages linked along the left.
MIT OpenCourseWare is a free & open publication of material from thousands of MIT courses, covering the entire MIT curriculum.
No enrollment or registration. Freely browse and use OCW materials at your own pace. There's no signup, and no start or end dates.
Knowledge is your reward. Use OCW to guide your own life-long learning, or to teach others. We don't offer credit or certification for using OCW.
Made for sharing. Download files for later. Send to friends and colleagues. Modify, remix, and reuse (just remember to cite OCW as the source.)
Learn more at Get Started with MIT OpenCourseWare