1 00:00:00,000 --> 00:00:01,976 [SQUEAKING] 2 00:00:01,976 --> 00:00:04,446 [RUSTLING] 3 00:00:04,446 --> 00:00:07,904 [CLICKING] 4 00:00:12,850 --> 00:00:16,580 JASON KU: Good morning, everybody. 5 00:00:16,580 --> 00:00:18,400 How's everybody doing? 6 00:00:18,400 --> 00:00:20,920 Nice long weekend we just came from-- 7 00:00:20,920 --> 00:00:21,820 I'm doing well. 8 00:00:21,820 --> 00:00:24,870 I'm actually getting over a little cold. 9 00:00:24,870 --> 00:00:27,220 Aw-- yeah, unfortunately. 10 00:00:27,220 --> 00:00:30,850 But after this, I don't have anything else this week, 11 00:00:30,850 --> 00:00:32,470 so that's good. 12 00:00:32,470 --> 00:00:40,390 OK, so last time, last week, we talked about how-- 13 00:00:40,390 --> 00:00:45,550 we looked at the search problem that we talked about 14 00:00:45,550 --> 00:00:49,070 earlier that week and showed that, 15 00:00:49,070 --> 00:00:51,200 in a certain model of computation, 16 00:00:51,200 --> 00:00:57,080 where I could only compare two objects that I'm 17 00:00:57,080 --> 00:00:58,820 storing in my-- 18 00:00:58,820 --> 00:01:02,480 that I'm storing and get some constant number 19 00:01:02,480 --> 00:01:04,489 of outputs on what I could-- 20 00:01:04,489 --> 00:01:06,140 how I could identify these things, 21 00:01:06,140 --> 00:01:08,180 like equal, or less than, or something 22 00:01:08,180 --> 00:01:11,070 like that, then we drew a decision tree 23 00:01:11,070 --> 00:01:17,430 and we got this bound that, if I had n outputs, 24 00:01:17,430 --> 00:01:21,760 I would require my decision tree to be at least log n height. 25 00:01:21,760 --> 00:01:27,000 And so in this model, I can't find the things faster 26 00:01:27,000 --> 00:01:28,810 than log n time. 27 00:01:28,810 --> 00:01:33,620 But luckily, we are in a model of computation which 28 00:01:33,620 --> 00:01:35,690 has a stronger operation-- 29 00:01:35,690 --> 00:01:38,680 namely, random accessing. 30 00:01:38,680 --> 00:01:43,390 And if we stored the things that we're looking for, 31 00:01:43,390 --> 00:01:47,710 we have unique keys, and those keys are integers. 32 00:01:47,710 --> 00:01:52,020 Then, if I have an item with key K, 33 00:01:52,020 --> 00:01:57,540 if I store it at index K in my array, 34 00:01:57,540 --> 00:02:02,220 then I can find it and manipulate it in constant time. 35 00:02:02,220 --> 00:02:04,667 That's pretty cool. 36 00:02:04,667 --> 00:02:06,500 That's what we called a direct access array. 37 00:02:06,500 --> 00:02:08,333 A direct access array-- really not different 38 00:02:08,333 --> 00:02:11,000 than a regular array, except how are you using it 39 00:02:11,000 --> 00:02:13,550 when we were talking about sequences is we 40 00:02:13,550 --> 00:02:19,910 are giving extrinsic semantics to the slots where 41 00:02:19,910 --> 00:02:21,110 we are storing these things. 42 00:02:21,110 --> 00:02:26,120 Basically, I could put any item in any slot. 43 00:02:26,120 --> 00:02:28,160 Where it was in my array had nothing 44 00:02:28,160 --> 00:02:30,590 to do with what those things were. 45 00:02:30,590 --> 00:02:35,900 Here we are imposing intrinsic semantics on my array 46 00:02:35,900 --> 00:02:42,980 that, if I have an item with key K, it must be at index K. 47 00:02:42,980 --> 00:02:46,970 That's the thing that we're taking advantage of here. 48 00:02:46,970 --> 00:02:50,540 And then we can use this nice, powerful linear branching 49 00:02:50,540 --> 00:02:54,183 random access operation to find that thing in constant time, 50 00:02:54,183 --> 00:02:55,850 because that's our model of computation. 51 00:02:55,850 --> 00:03:01,160 OK, then what was the problem with this direct access array? 52 00:03:01,160 --> 00:03:02,120 Anyone shout it out. 53 00:03:06,180 --> 00:03:07,800 Space-- right. 54 00:03:07,800 --> 00:03:10,980 So we had to instantiate a direct access 55 00:03:10,980 --> 00:03:15,180 array that was the size of the space of our keys. 56 00:03:15,180 --> 00:03:18,930 In general, my index location is-- 57 00:03:18,930 --> 00:03:21,520 could go from 0 to some positive number. 58 00:03:21,520 --> 00:03:24,270 If I a very large positive numbers, if I was sorting-- 59 00:03:24,270 --> 00:03:27,420 if I was searching among your MIT IDs, 60 00:03:27,420 --> 00:03:29,280 I'd have to have a direct access array that 61 00:03:29,280 --> 00:03:33,450 was that spanned that space of possible keys you could have. 62 00:03:33,450 --> 00:03:35,850 And that could be much larger than n. 63 00:03:35,850 --> 00:03:39,510 And so the rest of the time we talked about how 64 00:03:39,510 --> 00:03:42,570 to fix that space problem. 65 00:03:42,570 --> 00:03:46,500 We can reduce the space by taking that larger key space 66 00:03:46,500 --> 00:03:49,290 from 0 to u, which could be very large, 67 00:03:49,290 --> 00:03:51,690 and map it down to a small space. 68 00:03:51,690 --> 00:03:56,740 Now, in general, if I give you a fixed hash function there, 69 00:03:56,740 --> 00:04:00,440 that's not going to be good in-- for all inputs. 70 00:04:00,440 --> 00:04:04,160 If your inputs are very well distributed over the key space, 71 00:04:04,160 --> 00:04:07,670 then it is good, but in general, there 72 00:04:07,670 --> 00:04:12,680 would be hash functions with some inputs that will be bad. 73 00:04:12,680 --> 00:04:14,720 That's what we argued. 74 00:04:14,720 --> 00:04:16,820 And so for the rest of the time there, 75 00:04:16,820 --> 00:04:20,390 we talked about hash families, choosing a hash function 76 00:04:20,390 --> 00:04:24,770 randomly from among a large set of hash functions, 77 00:04:24,770 --> 00:04:28,850 which had a property that, if I chose this thing randomly 78 00:04:28,850 --> 00:04:32,660 and you, generating your input, didn't know which random 79 00:04:32,660 --> 00:04:38,170 numbers I was picking, the expectation over my random 80 00:04:38,170 --> 00:04:39,003 choice-- me-- 81 00:04:39,003 --> 00:04:40,420 I'm the one running the algorithm, 82 00:04:40,420 --> 00:04:43,800 not you giving me the input-- 83 00:04:43,800 --> 00:04:46,230 that random choice-- my algorithm 84 00:04:46,230 --> 00:04:48,360 actually behaves really well in expectation. 85 00:04:48,360 --> 00:04:50,760 In particular, I got constant time 86 00:04:50,760 --> 00:04:55,470 for finding, inserting, and deleting into this data 87 00:04:55,470 --> 00:04:57,390 structure, in expectation. 88 00:04:57,390 --> 00:05:00,930 We did a little proof of-- 89 00:05:00,930 --> 00:05:04,500 that the chain links where we stored collisions in our hash 90 00:05:04,500 --> 00:05:06,120 function-- in our hash table-- 91 00:05:06,120 --> 00:05:10,750 sorry-- those wouldn't be very long, 92 00:05:10,750 --> 00:05:13,500 and so if they were constant, then I 93 00:05:13,500 --> 00:05:16,560 don't have to search more than a constant number of things 94 00:05:16,560 --> 00:05:19,710 when I go to an-- a hashed index location. 95 00:05:19,710 --> 00:05:23,070 Does everyone remember what we talked about last week? 96 00:05:26,930 --> 00:05:29,120 I didn't show you this chart at the end, 97 00:05:29,120 --> 00:05:31,370 but I'm showing it to you now. 98 00:05:31,370 --> 00:05:34,430 Essentially, what we had was we have a bunch of different ways 99 00:05:34,430 --> 00:05:37,550 to deal with this set interface. 100 00:05:37,550 --> 00:05:40,850 And last week, we talked about the sorted array, 101 00:05:40,850 --> 00:05:43,490 and then we talked about this direct access array and this 102 00:05:43,490 --> 00:05:52,300 hash table, which do better for these dictionary-- the find, 103 00:05:52,300 --> 00:05:55,090 and insert, and delete operations-- 104 00:05:55,090 --> 00:05:58,750 or at least better in an expected sense. 105 00:05:58,750 --> 00:06:02,080 What's the worst case performance of a hash table? 106 00:06:06,290 --> 00:06:08,840 If I have to look up something in a hash table, 107 00:06:08,840 --> 00:06:12,200 and I happen to choose a bad hash table-- hash function, 108 00:06:12,200 --> 00:06:14,420 what's the worst case here? 109 00:06:14,420 --> 00:06:15,560 What? 110 00:06:15,560 --> 00:06:16,550 n, right? 111 00:06:16,550 --> 00:06:20,300 It's worse than a sorted array, because potentially, I 112 00:06:20,300 --> 00:06:22,280 hashed everything that I was storing 113 00:06:22,280 --> 00:06:24,860 to the same index in my hash table, 114 00:06:24,860 --> 00:06:27,380 and to be able to distinguish between them, 115 00:06:27,380 --> 00:06:30,860 I can't do anything more than a linear search. 116 00:06:30,860 --> 00:06:35,810 I could store another set's data structure as my chain 117 00:06:35,810 --> 00:06:37,430 and do better that way. 118 00:06:37,430 --> 00:06:39,680 That's actually how Java does it. 119 00:06:39,680 --> 00:06:42,140 They store a data structure we're 120 00:06:42,140 --> 00:06:45,350 going to be talking about next week as the chains 121 00:06:45,350 --> 00:06:48,170 so that they can get, worst case, log n. 122 00:06:48,170 --> 00:06:54,138 But in general, that hash table is only good 123 00:06:54,138 --> 00:06:54,930 if we're allowing-- 124 00:06:54,930 --> 00:06:58,410 OK, I want this to be expected good, but in the worst case, 125 00:06:58,410 --> 00:07:01,080 if I really need that operation to be worst case-- 126 00:07:01,080 --> 00:07:03,780 I really can't afford linear time ever 127 00:07:03,780 --> 00:07:05,467 for an operation of that kind-- 128 00:07:05,467 --> 00:07:07,050 then I don't want to use a hash table. 129 00:07:07,050 --> 00:07:10,560 And so on your p set 2, everything we ask you for 130 00:07:10,560 --> 00:07:12,570 is worst case, so probably, you don't 131 00:07:12,570 --> 00:07:14,800 want to be using hash tables. 132 00:07:14,800 --> 00:07:16,300 OK? 133 00:07:16,300 --> 00:07:17,117 Yes? 134 00:07:17,117 --> 00:07:18,310 AUDIENCE: What does the subscript e mean? 135 00:07:18,310 --> 00:07:19,570 JASON KU: What does the subscript e mean? 136 00:07:19,570 --> 00:07:20,200 That's great. 137 00:07:20,200 --> 00:07:28,840 In this chart, I put a subscript on this is an expected runtime, 138 00:07:28,840 --> 00:07:31,510 or an A meaning this is an amortized runtime. 139 00:07:31,510 --> 00:07:35,290 At the end, we talked about how, if we had too many things 140 00:07:35,290 --> 00:07:40,870 in our hash table, then, as long as we didn't do it too often-- 141 00:07:40,870 --> 00:07:42,460 this is a little hand wavey argument, 142 00:07:42,460 --> 00:07:45,490 but the same kinds of ideas as the dynamic array-- 143 00:07:45,490 --> 00:07:48,720 if, whenever we got a linear-- 144 00:07:48,720 --> 00:07:53,970 we are more than a linear factor away from where we are trying-- 145 00:07:53,970 --> 00:07:56,580 basically, the fill factor we were trying to be, 146 00:07:56,580 --> 00:07:58,950 then we could just completely rebuild the hash table 147 00:07:58,950 --> 00:08:00,780 with the new hash function randomly 148 00:08:00,780 --> 00:08:03,920 chosen from our hash table with a new size, 149 00:08:03,920 --> 00:08:05,690 and we could get amortized bounds. 150 00:08:05,690 --> 00:08:07,220 And so that's what Python-- 151 00:08:07,220 --> 00:08:10,820 how Python implements dictionaries, or sets, or even 152 00:08:10,820 --> 00:08:17,500 objects when it's trying to map keys to different things. 153 00:08:17,500 --> 00:08:19,050 So that's hash tables. 154 00:08:19,050 --> 00:08:20,310 That's great. 155 00:08:20,310 --> 00:08:25,350 The key thing here is, well, actually, if your range of keys 156 00:08:25,350 --> 00:08:28,770 is small, or if you as a programmer 157 00:08:28,770 --> 00:08:31,920 have the ability to choose the keys that you identify 158 00:08:31,920 --> 00:08:33,809 your objects with, you can actually 159 00:08:33,809 --> 00:08:36,270 choose that range to be small, to be linear, 160 00:08:36,270 --> 00:08:39,107 to be small with respect to your items. 161 00:08:39,107 --> 00:08:40,440 And you don't need a hash table. 162 00:08:40,440 --> 00:08:43,230 You can just use a direct access array, 163 00:08:43,230 --> 00:08:48,580 because if you know your key space is small, that's great. 164 00:08:48,580 --> 00:08:50,097 So a lot of C programmers probably 165 00:08:50,097 --> 00:08:52,180 would like to do something like that, because they 166 00:08:52,180 --> 00:08:53,470 don't have access to-- 167 00:08:53,470 --> 00:08:58,760 maybe C++ programmers would have access to their hash table. 168 00:08:58,760 --> 00:09:02,370 Any questions on this stuff before we move on? 169 00:09:02,370 --> 00:09:03,091 Yeah? 170 00:09:03,091 --> 00:09:07,240 AUDIENCE: So why is [INAUDIBLE]? 171 00:09:07,240 --> 00:09:09,040 JASON KU: Why is it expected? 172 00:09:09,040 --> 00:09:12,610 When I'm building, I could insert-- 173 00:09:12,610 --> 00:09:16,720 I'm inserting these things from x 1 by 1 into my hash table. 174 00:09:16,720 --> 00:09:19,840 Each of those insert operations-- 175 00:09:19,840 --> 00:09:22,480 I'm looking up to see whether that-- 176 00:09:22,480 --> 00:09:25,460 an item with that key already exists in my hash table. 177 00:09:25,460 --> 00:09:29,120 And so I have to look down the chain to see where it is. 178 00:09:29,120 --> 00:09:33,010 However, if I happen to know that all of my keys 179 00:09:33,010 --> 00:09:35,960 are unique in my input, all the items I'm trying to store 180 00:09:35,960 --> 00:09:38,380 are unique, then I don't have to do that check 181 00:09:38,380 --> 00:09:40,270 and I can get worst case linear time. 182 00:09:40,270 --> 00:09:42,160 Does that make sense? 183 00:09:42,160 --> 00:09:42,680 All right. 184 00:09:42,680 --> 00:09:45,360 It's a subtlety, but that's a great question. 185 00:09:45,360 --> 00:09:49,280 OK, so today, instead of talking about searching, 186 00:09:49,280 --> 00:09:52,030 we're talking about sorting. 187 00:09:52,030 --> 00:09:57,360 Last week, we saw a few ways to do sort. 188 00:09:57,360 --> 00:09:59,940 Some of them were quadratic-- insertion sort and selection 189 00:09:59,940 --> 00:10:00,450 sort-- 190 00:10:00,450 --> 00:10:02,340 and then we had one that was n log n. 191 00:10:02,340 --> 00:10:06,330 And this thing, n log n, seemed pretty good, 192 00:10:06,330 --> 00:10:08,160 but can I do better? 193 00:10:11,070 --> 00:10:12,788 Can I do better? 194 00:10:12,788 --> 00:10:15,330 Well, what we're going to show at the beginning of this class 195 00:10:15,330 --> 00:10:19,640 is, in this comparison model, no. 196 00:10:19,640 --> 00:10:20,990 n log n is optimal. 197 00:10:20,990 --> 00:10:23,210 And we're going to go through the exact same line 198 00:10:23,210 --> 00:10:26,430 of reasoning that we had last week. 199 00:10:26,430 --> 00:10:35,340 So in the comparison model, what did we 200 00:10:35,340 --> 00:10:40,020 use when we were trying to make this argument 201 00:10:40,020 --> 00:10:45,830 that any comparison model algorithm was going 202 00:10:45,830 --> 00:10:48,470 to take at least log n time? 203 00:10:48,470 --> 00:10:50,690 What we did was we said, OK, I can 204 00:10:50,690 --> 00:10:54,080 think of any model in the comparison model-- 205 00:10:54,080 --> 00:10:59,060 any algorithm in the comparison model as kind of this-- 206 00:10:59,060 --> 00:11:01,250 some comparisons happen. 207 00:11:01,250 --> 00:11:03,950 They branch in a binary sense, but you 208 00:11:03,950 --> 00:11:07,260 could have it generalized to any constant branching factor. 209 00:11:07,260 --> 00:11:10,590 But for our purposes, binary's fine. 210 00:11:10,590 --> 00:11:16,040 And what we said was that there were at least n outputs-- 211 00:11:16,040 --> 00:11:17,660 really n plus 1, but-- 212 00:11:17,660 --> 00:11:20,270 at least order n outputs. 213 00:11:20,270 --> 00:11:22,520 And we showed that-- 214 00:11:22,520 --> 00:11:25,430 or we argued to you that the height of this tree 215 00:11:25,430 --> 00:11:29,210 had to be at least log n-- 216 00:11:31,880 --> 00:11:34,130 log the number of leaves. 217 00:11:34,130 --> 00:11:36,320 It had to be at least log the number of leaves. 218 00:11:36,320 --> 00:11:38,840 That was the height of the decision tree. 219 00:11:38,840 --> 00:11:44,120 And if this decision tree represented a search algorithm, 220 00:11:44,120 --> 00:11:48,110 I had to walk down and perform these comparisons 221 00:11:48,110 --> 00:11:53,930 in order, reach a leaf where I would output something. 222 00:11:53,930 --> 00:12:00,050 If the minimum height of any binary tree on a linear number 223 00:12:00,050 --> 00:12:05,030 of leaves is log n, then any algorithm 224 00:12:05,030 --> 00:12:10,570 in the comparison model also has to take log n time, 225 00:12:10,570 --> 00:12:13,090 because it has to do that many comparisons to differentiate 226 00:12:13,090 --> 00:12:16,150 between all possible outputs. 227 00:12:16,150 --> 00:12:18,490 Does that make sense? 228 00:12:18,490 --> 00:12:19,180 All right. 229 00:12:19,180 --> 00:12:28,070 So in the sort problem, how many possible outputs are there? 230 00:12:31,220 --> 00:12:33,405 What is the output of a sorting algorithm? 231 00:12:38,740 --> 00:12:39,710 AUDIENCE: [INAUDIBLE] 232 00:12:39,710 --> 00:12:40,916 JASON KU: What? 233 00:12:40,916 --> 00:12:42,940 What's up? 234 00:12:42,940 --> 00:12:48,080 A list-- in particular, given my input-- 235 00:12:48,080 --> 00:12:56,330 some set of items A that has size n-- 236 00:12:56,330 --> 00:13:00,650 what I'm going to give you is some permutation of that list. 237 00:13:00,650 --> 00:13:06,310 So for each index, say, I could tell you where it goes. 238 00:13:09,190 --> 00:13:13,060 Another way I could say is, where does the first item 239 00:13:13,060 --> 00:13:16,570 go to, where does the second item go 240 00:13:16,570 --> 00:13:18,690 to, where does the third item go to-- blah, blah, 241 00:13:18,690 --> 00:13:20,970 blah-- like that. 242 00:13:20,970 --> 00:13:25,210 So how many different choices of a permutation are there? 243 00:13:25,210 --> 00:13:29,130 Well, how many choices do I have for the first thing of where 244 00:13:29,130 --> 00:13:31,740 it could be in the final sorted array? 245 00:13:31,740 --> 00:13:36,790 It could be in any of the places, so it's n. 246 00:13:36,790 --> 00:13:38,830 How about this one, the second one? 247 00:13:38,830 --> 00:13:41,470 Well, it can't go to where this one went, 248 00:13:41,470 --> 00:13:43,430 right but it can go anywhere else. 249 00:13:43,430 --> 00:13:45,280 So it's n minus 1. 250 00:13:45,280 --> 00:13:47,600 And since these are independent choices I'm making, 251 00:13:47,600 --> 00:13:49,240 if I multiply them all together, I 252 00:13:49,240 --> 00:13:52,030 get 9 factorial permutations that 253 00:13:52,030 --> 00:13:53,830 are the number of possible outputs 254 00:13:53,830 --> 00:13:55,780 that I have to my sorting algorithm. 255 00:13:55,780 --> 00:13:58,870 So for me, to have an output to my sorting algorithm 256 00:13:58,870 --> 00:14:01,990 be correct, I need at least n factorial leaves. 257 00:14:01,990 --> 00:14:03,730 Does that make sense? 258 00:14:03,730 --> 00:14:04,230 OK. 259 00:14:07,260 --> 00:14:10,170 The nice thing about doing this last week 260 00:14:10,170 --> 00:14:14,010 is this is really just the number of leaves 261 00:14:14,010 --> 00:14:16,740 and this is really the number of leaves. 262 00:14:16,740 --> 00:14:21,090 So what's the number of leaves is theta n factorial. 263 00:14:21,090 --> 00:14:23,370 Here it's actually n factorial, but I'm just 264 00:14:23,370 --> 00:14:25,410 going to put it there. 265 00:14:25,410 --> 00:14:27,690 And here we get an n factorial. 266 00:14:33,460 --> 00:14:34,030 I see. 267 00:14:34,030 --> 00:14:39,640 So it's at least omega n factorial. 268 00:14:39,640 --> 00:14:42,320 Does that make you happier? 269 00:14:42,320 --> 00:14:44,120 Theta here-- thank you-- 270 00:14:44,120 --> 00:14:47,040 has to be at least. 271 00:14:47,040 --> 00:14:47,790 So this was right. 272 00:14:50,580 --> 00:14:53,820 OK, so at least this many-- 273 00:14:53,820 --> 00:14:57,480 there are algorithms that, if it got-- 274 00:14:57,480 --> 00:14:59,100 it could take two different routes 275 00:14:59,100 --> 00:15:00,900 to get to the same output. 276 00:15:00,900 --> 00:15:03,970 So this is a lower bound on the number of leaves. 277 00:15:03,970 --> 00:15:05,070 OK? 278 00:15:05,070 --> 00:15:06,690 So what this argument is saying is 279 00:15:06,690 --> 00:15:10,110 that, if I just replace the number of leaves n here 280 00:15:10,110 --> 00:15:13,500 with n factorial, I get a similar comparison 281 00:15:13,500 --> 00:15:15,660 sort lower bound now. 282 00:15:15,660 --> 00:15:17,790 So what is log of n factorial? 283 00:15:20,980 --> 00:15:24,610 This is familiar from p set 1 maybe. 284 00:15:24,610 --> 00:15:28,820 So one thing I could do is I could put in Sterling formula, 285 00:15:28,820 --> 00:15:29,320 right? 286 00:15:32,690 --> 00:15:35,370 And that'll give me something of the form n log n. 287 00:15:35,370 --> 00:15:41,180 But what's another way I could lower bound n factorial? 288 00:15:41,180 --> 00:15:42,850 Well, I have a bunch of things here. 289 00:15:46,030 --> 00:15:48,280 That's n factorial. 290 00:15:48,280 --> 00:15:50,480 Half of these things-- 291 00:15:50,480 --> 00:15:53,980 these half, n/2 things-- 292 00:15:53,980 --> 00:15:58,770 are bigger than or equal to n/2. 293 00:15:58,770 --> 00:16:00,600 That make sense? 294 00:16:00,600 --> 00:16:08,930 So I can certainly lower bound this thing by n/2 to the n/2. 295 00:16:08,930 --> 00:16:11,780 That's a little easier thing to take a log of. 296 00:16:11,780 --> 00:16:17,250 If you take a log of that, that's asymptotically n log n. 297 00:16:17,250 --> 00:16:20,850 So what we're getting here is any sorting algorithm here 298 00:16:20,850 --> 00:16:24,270 takes at least n log n comparisons, 299 00:16:24,270 --> 00:16:26,100 and so a merge sort's the best we can do. 300 00:16:28,950 --> 00:16:30,480 That make sense to everybody? 301 00:16:30,480 --> 00:16:33,600 We're just piggybacking on the analysis we had about decision 302 00:16:33,600 --> 00:16:38,800 trees, connecting leaves with the minimum height 303 00:16:38,800 --> 00:16:43,090 of any binary tree on that number of leaves, 304 00:16:43,090 --> 00:16:46,090 and just replacing n with n factorial-- 305 00:16:46,090 --> 00:16:48,650 nothing super interesting here. 306 00:16:48,650 --> 00:16:49,150 Yeah? 307 00:16:49,150 --> 00:16:51,450 AUDIENCE: [INAUDIBLE] the n over 2. 308 00:16:51,450 --> 00:16:53,047 JASON KU: Yeah, sure. 309 00:16:53,047 --> 00:16:54,630 You can just plug in Sterling formula, 310 00:16:54,630 --> 00:16:58,950 but I did this, so I might as well clarify. 311 00:16:58,950 --> 00:17:03,320 There are n terms here in the product. 312 00:17:03,320 --> 00:17:06,214 Half of them are at least n/2. 313 00:17:06,214 --> 00:17:07,089 Does that make sense? 314 00:17:09,660 --> 00:17:12,300 I can lower bound this product by something 315 00:17:12,300 --> 00:17:15,240 smaller than half of the terms-- 316 00:17:15,240 --> 00:17:18,119 a product of that, and that'll be fine. 317 00:17:18,119 --> 00:17:23,819 So I'm taking n/2 of them and I'm multiplying n/2 altogether, 318 00:17:23,819 --> 00:17:24,825 n/2 times. 319 00:17:24,825 --> 00:17:25,700 Does that make sense? 320 00:17:28,650 --> 00:17:31,250 It's just providing a lower bound. 321 00:17:31,250 --> 00:17:33,200 I just need something that's smaller 322 00:17:33,200 --> 00:17:34,460 than all of these terms. 323 00:17:34,460 --> 00:17:36,260 And multiply them all together, and that'll 324 00:17:36,260 --> 00:17:39,120 give me a lower bound. 325 00:17:39,120 --> 00:17:43,210 OK, so we can't do better than n log n in the comparison model, 326 00:17:43,210 --> 00:17:46,650 but what we did last week was use 327 00:17:46,650 --> 00:17:49,500 random access and a direct access array to do better. 328 00:17:49,500 --> 00:17:51,750 OK? 329 00:17:51,750 --> 00:17:57,170 Can anyone think of how to use that idea to sort faster? 330 00:17:57,170 --> 00:18:00,320 And I'm going to give you a caveat here. 331 00:18:00,320 --> 00:18:04,400 I'm going to let you assume that the keys of the things you're 332 00:18:04,400 --> 00:18:05,960 trying to sort out are unique. 333 00:18:08,850 --> 00:18:13,760 And say they're in a bound-- in a small range. 334 00:18:13,760 --> 00:18:19,610 So how could I use a direct access array to sort faster? 335 00:18:19,610 --> 00:18:21,260 Any ideas? 336 00:18:21,260 --> 00:18:21,760 Yeah? 337 00:18:21,760 --> 00:18:23,732 AUDIENCE: Could you just literally 338 00:18:23,732 --> 00:18:26,267 insert [INAUDIBLE] into a direct access array? 339 00:18:26,267 --> 00:18:26,975 JASON KU: Uh-huh. 340 00:18:26,975 --> 00:18:31,410 AUDIENCE: And then you look at that array and how to sort it. 341 00:18:31,410 --> 00:18:31,980 JASON KU: OK. 342 00:18:31,980 --> 00:18:34,230 So what your colleague is saying is exactly correct. 343 00:18:34,230 --> 00:18:37,620 It's something that I like to call direct access array sort. 344 00:18:37,620 --> 00:18:41,370 We won't really call it that, because there's something more 345 00:18:41,370 --> 00:18:45,190 general that we'll talk about in just a second. 346 00:18:45,190 --> 00:18:47,340 But what your colleague was saying is, 347 00:18:47,340 --> 00:18:50,610 instantiate a big direct access array-- 348 00:18:50,610 --> 00:18:53,805 direct access array sort. 349 00:18:56,570 --> 00:18:59,510 I'm instantiating this big direct access 350 00:18:59,510 --> 00:19:03,910 array of the space of my keys, and what 351 00:19:03,910 --> 00:19:05,410 your colleague was saying was I take 352 00:19:05,410 --> 00:19:08,812 each one of the items in my-- 353 00:19:08,812 --> 00:19:10,270 the things that I'm trying to sort, 354 00:19:10,270 --> 00:19:11,980 I look at each one of their keys, 355 00:19:11,980 --> 00:19:16,240 and I stick it in the direct accessory 356 00:19:16,240 --> 00:19:20,740 exactly where it needs to go, in constant time. 357 00:19:20,740 --> 00:19:22,150 That's great. 358 00:19:22,150 --> 00:19:25,150 Now, I gave you this caveat that all the keys were unique, 359 00:19:25,150 --> 00:19:27,580 so I don't have to deal with collisions here. 360 00:19:27,580 --> 00:19:30,430 But then, after I'm done with this, all of these things 361 00:19:30,430 --> 00:19:33,040 are now in sorted order, and what I can do 362 00:19:33,040 --> 00:19:37,240 is I can just walk down this list. 363 00:19:37,240 --> 00:19:39,550 A lot of these cells are empty, potentially. 364 00:19:42,430 --> 00:19:44,890 Some of the keys might not be there, but what 365 00:19:44,890 --> 00:19:47,170 I can do is just walk down this list, 366 00:19:47,170 --> 00:19:51,950 pick off every item that does exist, stick them in an array-- 367 00:19:51,950 --> 00:19:52,450 I'm done. 368 00:19:55,990 --> 00:20:00,400 Stick a key into here and then-- 369 00:20:00,400 --> 00:20:03,260 all right. 370 00:20:03,260 --> 00:20:08,750 Make direct access array. 371 00:20:08,750 --> 00:20:20,630 Store items-- item x in index x.key. 372 00:20:25,700 --> 00:20:43,000 Walk down direct access array, and return items seen in order. 373 00:20:43,000 --> 00:20:45,560 Does that make sense to everybody? 374 00:20:45,560 --> 00:20:47,410 All right, how long does this step take? 375 00:20:53,910 --> 00:20:58,380 Building a direct access array order u-- 376 00:20:58,380 --> 00:21:01,970 OK, so this is order u-- 377 00:21:01,970 --> 00:21:02,970 how long does this take? 378 00:21:06,390 --> 00:21:08,160 How many items you have to insert? 379 00:21:08,160 --> 00:21:11,240 Order n, or just n-- 380 00:21:11,240 --> 00:21:14,210 and how long does it take to insert each one of these things 381 00:21:14,210 --> 00:21:17,110 into my direct access array? 382 00:21:17,110 --> 00:21:20,850 Worst case constant time-- 383 00:21:20,850 --> 00:21:26,220 so this is n times worst case constant time-- 384 00:21:26,220 --> 00:21:27,115 great. 385 00:21:27,115 --> 00:21:28,490 How long does this last one take? 386 00:21:35,030 --> 00:21:36,880 Anyone? 387 00:21:36,880 --> 00:21:39,070 O of u also-- right, because I'm walking down 388 00:21:39,070 --> 00:21:40,150 the entire length of u. 389 00:21:43,510 --> 00:21:50,840 So this algorithm takes, in total, n plus u time. 390 00:21:50,840 --> 00:21:53,810 This is great. 391 00:21:53,810 --> 00:21:56,375 u is bigger than n, because we assumed distinct keys. 392 00:21:58,920 --> 00:22:03,150 But if u is on the order of n, then we now 393 00:22:03,150 --> 00:22:04,650 have linear time sorting algorithm. 394 00:22:04,650 --> 00:22:07,360 Yes? 395 00:22:07,360 --> 00:22:08,306 What's up? 396 00:22:08,306 --> 00:22:10,277 AUDIENCE: [INAUDIBLE] 397 00:22:10,277 --> 00:22:11,110 JASON KU: I'm sorry. 398 00:22:11,110 --> 00:22:11,985 You have to speak up. 399 00:22:11,985 --> 00:22:16,670 AUDIENCE: How do you attach keys to the [INAUDIBLE]?? 400 00:22:16,670 --> 00:22:21,760 JASON KU: How do I attach keys to my inputs in my-- 401 00:22:21,760 --> 00:22:25,090 for a set data structure that we've been talking about, 402 00:22:25,090 --> 00:22:27,350 all of my items have keys. 403 00:22:27,350 --> 00:22:30,810 That's just something that we impose on our input. 404 00:22:30,810 --> 00:22:34,310 AUDIENCE: [INAUDIBLE] 405 00:22:34,310 --> 00:22:36,270 JASON KU: Each of the keys is-- in this case, 406 00:22:36,270 --> 00:22:37,190 it has to be a number. 407 00:22:40,380 --> 00:22:42,150 That's a nice point. 408 00:22:42,150 --> 00:22:48,260 We do this to talk about sorting items generally so 409 00:22:48,260 --> 00:22:51,050 that we don't have to deal with potentially if these keys have 410 00:22:51,050 --> 00:22:53,360 values associated with-- or other stuff associated-- 411 00:22:53,360 --> 00:22:56,100 put them on that item, and they'll still be there. 412 00:22:56,100 --> 00:22:59,090 But in general, if you just wanted to sort integers, 413 00:22:59,090 --> 00:23:02,480 you could say that .key is-- 414 00:23:02,480 --> 00:23:04,545 points back to the object itself, 415 00:23:04,545 --> 00:23:06,170 if you want to just sort some integers. 416 00:23:06,170 --> 00:23:07,220 Does that make sense? 417 00:23:07,220 --> 00:23:09,220 It's a good question, though. 418 00:23:09,220 --> 00:23:12,880 OK, so that gives us a linear time algorithm 419 00:23:12,880 --> 00:23:16,510 when u is small, and under this condition 420 00:23:16,510 --> 00:23:20,050 that I have unique keys when I want to sort. 421 00:23:20,050 --> 00:23:22,840 Those are fairly restrictive, so we 422 00:23:22,840 --> 00:23:24,730 might want to generalize this a little bit. 423 00:23:24,730 --> 00:23:26,160 OK? 424 00:23:26,160 --> 00:23:30,030 So that's direct access array sort. 425 00:23:30,030 --> 00:23:34,410 What if we had a set of keys that was a little larger? 426 00:23:40,480 --> 00:23:53,250 So let's say u is theta n implies linear time sorting. 427 00:23:53,250 --> 00:23:54,810 That's great. 428 00:23:54,810 --> 00:23:59,250 So now, what happens if we expand that range a little bit? 429 00:23:59,250 --> 00:24:02,880 Say u is less than or equal to n squared-- 430 00:24:02,880 --> 00:24:03,960 maybe just less than. 431 00:24:06,470 --> 00:24:14,300 OK, this is a bigger range And if we instantiated 432 00:24:14,300 --> 00:24:18,500 a direct access array of quadratics size, 433 00:24:18,500 --> 00:24:20,210 we'd have a quadratic time algorithm. 434 00:24:20,210 --> 00:24:21,170 This is not helpful. 435 00:24:24,000 --> 00:24:30,740 Anyone have a way in which we could sort integers that 436 00:24:30,740 --> 00:24:33,170 are between 0 and n squared? 437 00:24:36,070 --> 00:24:38,140 Maybe using the stuff that we had above-- 438 00:24:42,860 --> 00:24:43,360 Yeah? 439 00:24:43,360 --> 00:24:46,460 AUDIENCE: [INAUDIBLE] sort by the first n, 440 00:24:46,460 --> 00:24:49,790 kind of like the first digit. 441 00:24:49,790 --> 00:24:52,520 JASON KU: Your colleague is saying exactly 442 00:24:52,520 --> 00:24:54,980 the thing that I'm looking for, which is great, 443 00:24:54,980 --> 00:25:02,970 which is maybe we could break this larger number into two 444 00:25:02,970 --> 00:25:05,700 smaller numbers. 445 00:25:05,700 --> 00:25:12,810 Any integer that is between 0 n squared can be written as-- 446 00:25:15,330 --> 00:25:26,580 key can be some a and b, where a is essentially the higher n 447 00:25:26,580 --> 00:25:28,980 and b is the lower n. 448 00:25:28,980 --> 00:25:31,020 This is kind of weird. 449 00:25:31,020 --> 00:25:33,340 OK, so what do I actually mean by this? 450 00:25:33,340 --> 00:25:43,350 I mean that let's let a be K, when I divide it by n-- 451 00:25:43,350 --> 00:25:51,300 integer, the floor-- key integer to divide by n. 452 00:25:51,300 --> 00:25:57,000 And b equals K mod n. 453 00:25:57,000 --> 00:26:00,960 So this is a number that's less than n and this is 454 00:26:00,960 --> 00:26:03,120 a number that's less than n. 455 00:26:03,120 --> 00:26:04,480 Does that make sense? 456 00:26:04,480 --> 00:26:08,430 And actually, I can recover K at any time 457 00:26:08,430 --> 00:26:13,130 by saying K equals an plus b. 458 00:26:13,130 --> 00:26:17,710 I've essentially decomposed this into a base n 459 00:26:17,710 --> 00:26:21,350 representation of this number. 460 00:26:21,350 --> 00:26:23,690 And I have two digits in that number. 461 00:26:23,690 --> 00:26:26,510 This is the n-th-- 462 00:26:26,510 --> 00:26:29,560 n digit, and this is the ones digit. 463 00:26:29,560 --> 00:26:31,460 Does that make sense? 464 00:26:31,460 --> 00:26:36,035 All right, so now let's say I have this list of numbers-- 465 00:26:43,980 --> 00:26:49,390 17, 3, 24, 22, 12. 466 00:26:54,660 --> 00:26:57,610 Here I have five numbers. 467 00:26:57,610 --> 00:27:00,290 So what's n in this case? 468 00:27:00,290 --> 00:27:03,320 5-- OK, not so interesting. 469 00:27:03,320 --> 00:27:06,320 n is 5 here. 470 00:27:06,320 --> 00:27:11,870 And I'm going to represent this as five pairs of numbers 471 00:27:11,870 --> 00:27:15,930 that are each within the bounds of 0 to 4. 472 00:27:15,930 --> 00:27:17,310 Does that makes sense? 473 00:27:17,310 --> 00:27:20,525 So what is my a, b representation of 17? 474 00:27:24,320 --> 00:27:30,380 3, 2-- OK. 475 00:27:30,380 --> 00:27:35,400 Yeah, so there are 3 times 5 plus 2. 476 00:27:35,400 --> 00:27:35,900 That's good. 477 00:27:35,900 --> 00:27:37,040 That's 17. 478 00:27:37,040 --> 00:27:38,240 Yeah? 479 00:27:38,240 --> 00:27:41,600 I think your colleague did that, right? 480 00:27:41,600 --> 00:27:43,220 I have all of these written down, 481 00:27:43,220 --> 00:27:44,637 so I'm just going to write it out. 482 00:27:53,810 --> 00:27:56,690 And I hope I did it correctly. 483 00:27:56,690 --> 00:28:00,890 OK-- 3, 2; 0, 3; 4, 4; 4, 2; 2, 2-- 484 00:28:00,890 --> 00:28:02,230 OK. 485 00:28:02,230 --> 00:28:04,210 So now I have a bunch of things that I 486 00:28:04,210 --> 00:28:12,480 want to sort based on this function that I have. 487 00:28:12,480 --> 00:28:15,390 These are no longer just integers that I need to sort. 488 00:28:15,390 --> 00:28:19,470 I need to sort by this transformation of this thing 489 00:28:19,470 --> 00:28:20,580 into a number. 490 00:28:20,580 --> 00:28:22,400 Does that make sense? 491 00:28:22,400 --> 00:28:27,490 So anyone have any ideas on how we could-- 492 00:28:27,490 --> 00:28:31,810 by the way, these are both constant time operations 493 00:28:31,810 --> 00:28:36,370 on your computer, as long as it's an integer division 494 00:28:36,370 --> 00:28:38,080 and this is mod. 495 00:28:38,080 --> 00:28:41,260 Python also has a nice thing, I think, 496 00:28:41,260 --> 00:28:51,100 in its standard operations, which is divmod of K, n. 497 00:28:51,100 --> 00:28:52,250 Is that right? 498 00:28:52,250 --> 00:28:53,860 Yeah. 499 00:28:53,860 --> 00:28:55,360 So if you want to use that, you can. 500 00:28:58,180 --> 00:29:00,850 OK, so how do we sort these tuples? 501 00:29:00,850 --> 00:29:02,800 These are tuples, right? 502 00:29:02,800 --> 00:29:06,250 You guys are, I'm sure, very familiar with tuples by now. 503 00:29:08,860 --> 00:29:09,985 How do I sort these tuples? 504 00:29:14,680 --> 00:29:18,220 What's the most important digit of this thing? 505 00:29:18,220 --> 00:29:21,120 If I had to sort one of the digits 506 00:29:21,120 --> 00:29:24,250 and get something that's close to sorted, 507 00:29:24,250 --> 00:29:28,800 what's more important-- the 1's digit or the n's digit? 508 00:29:28,800 --> 00:29:31,170 OK, we have discrepancy here. 509 00:29:31,170 --> 00:29:33,700 Who says 1? 510 00:29:33,700 --> 00:29:35,380 Who says n? 511 00:29:35,380 --> 00:29:37,060 Someone who said n tell me why. 512 00:29:41,620 --> 00:29:45,742 Oh, you all think that way for no reason. 513 00:29:45,742 --> 00:29:47,560 AUDIENCE: [INAUDIBLE] 514 00:29:47,560 --> 00:29:48,550 JASON KU: Yeah. 515 00:29:48,550 --> 00:29:49,180 Sorry. 516 00:29:49,180 --> 00:29:50,600 This is a little confusing. 517 00:29:50,600 --> 00:29:51,580 This is the 1's digit. 518 00:29:51,580 --> 00:29:53,110 This is the n's digit. 519 00:29:53,110 --> 00:29:54,250 This is the n's digit. 520 00:29:54,250 --> 00:29:56,770 This is the 1's digit in how I'm writing this. 521 00:29:56,770 --> 00:29:58,710 Does that makes sense? 522 00:29:58,710 --> 00:29:59,210 Yeah? 523 00:29:59,210 --> 00:30:03,701 AUDIENCE: [INAUDIBLE] have a different ones digit 524 00:30:03,701 --> 00:30:04,700 inside of it. 525 00:30:04,700 --> 00:30:07,856 So you could have [INAUDIBLE] but 526 00:30:07,856 --> 00:30:10,211 that only tells you where they are with regard 527 00:30:10,211 --> 00:30:11,794 to the specific n category they're in. 528 00:30:11,794 --> 00:30:13,100 So it's more of a [INAUDIBLE]. 529 00:30:13,100 --> 00:30:13,725 JASON KU: Yeah. 530 00:30:13,725 --> 00:30:18,010 So what your colleague is saying is exactly correct. 531 00:30:18,010 --> 00:30:23,760 I could vary b all I want right with the same a. 532 00:30:23,760 --> 00:30:26,730 If I change a by 1, it doesn't matter what 533 00:30:26,730 --> 00:30:28,110 b is-- it's going to be bigger. 534 00:30:31,840 --> 00:30:33,910 Does that make sense? 535 00:30:33,910 --> 00:30:36,460 The K is much more sensitive to a 536 00:30:36,460 --> 00:30:40,040 than it is to b, so a is more important than b. 537 00:30:40,040 --> 00:30:42,080 Does that make sense? 538 00:30:42,080 --> 00:30:47,560 So if I just wanted to get some linear time algorithm, 539 00:30:47,560 --> 00:30:50,230 I could just sort by their bigger digits 540 00:30:50,230 --> 00:30:55,630 and hope they don't differ very much on the smaller things. 541 00:30:55,630 --> 00:30:57,800 I've kind of sorted these things. 542 00:30:57,800 --> 00:30:59,480 Does that make sense? 543 00:30:59,480 --> 00:30:59,980 OK. 544 00:30:59,980 --> 00:31:01,900 What if I actually want to sort these things? 545 00:31:06,250 --> 00:31:06,970 Any hints? 546 00:31:10,150 --> 00:31:12,590 Yeah? 547 00:31:12,590 --> 00:31:14,330 I need to sort on both, in some sense. 548 00:31:17,630 --> 00:31:19,130 What I'm going to tell you right now 549 00:31:19,130 --> 00:31:23,690 is an algorithm that I like to call tuple sort, 550 00:31:23,690 --> 00:31:28,280 but you can also think of it as Excel spreadsheets sort. 551 00:31:28,280 --> 00:31:31,220 I have an Excel spreadsheet of a bunch of data. 552 00:31:31,220 --> 00:31:34,665 I have a prioritization on how important the keys are to me-- 553 00:31:34,665 --> 00:31:35,165 the columns. 554 00:31:37,720 --> 00:31:42,730 And if I have a very important column and an order 555 00:31:42,730 --> 00:31:46,100 of the columns of how important they are to me, 556 00:31:46,100 --> 00:31:49,730 I can repeatedly sought on the columns 557 00:31:49,730 --> 00:31:54,010 until they're sorted based on my preference. 558 00:31:54,010 --> 00:31:57,080 That's something that you may have done. 559 00:31:57,080 --> 00:32:00,350 Now, if I have an ordering on the preferences of my columns, 560 00:32:00,350 --> 00:32:04,430 do I start by sorting all of them 561 00:32:04,430 --> 00:32:07,980 on the most important thing or the least important thing? 562 00:32:07,980 --> 00:32:10,320 What? 563 00:32:10,320 --> 00:32:12,590 Who says most? 564 00:32:12,590 --> 00:32:13,550 Who says least? 565 00:32:16,090 --> 00:32:18,970 There's discrepancy here. 566 00:32:18,970 --> 00:32:22,520 All right, let's try it out. 567 00:32:22,520 --> 00:32:26,060 All right, tuple sort-- 568 00:32:26,060 --> 00:32:30,950 let's start by sorting these things by least 569 00:32:30,950 --> 00:32:33,650 significant first, and then-- 570 00:32:33,650 --> 00:32:36,140 no, most significant first and then least significant. 571 00:32:36,140 --> 00:32:38,570 That was the first thing I asked you, right? 572 00:32:38,570 --> 00:32:41,940 All right, so these are the most significant things, 573 00:32:41,940 --> 00:32:42,620 the first ones. 574 00:32:42,620 --> 00:32:46,050 And these are the less significant things. 575 00:32:46,050 --> 00:32:48,930 All right, instead of writing it as tuples, 576 00:32:48,930 --> 00:32:55,110 I'm going to write them as 32, 03, 44, 42, 22. 577 00:32:55,110 --> 00:32:56,340 Is everyone cool that? 578 00:32:56,340 --> 00:33:00,440 This is just base five representation. 579 00:33:00,440 --> 00:33:04,610 All right, so let's start by sorting all of these things 580 00:33:04,610 --> 00:33:08,180 by the most significant thing, which 581 00:33:08,180 --> 00:33:12,620 is by this guy, this guy, this guy, this guy, and this guy. 582 00:33:12,620 --> 00:33:13,470 So how do I do it? 583 00:33:13,470 --> 00:33:22,530 The first one is 03, second one is 22, the next one is 32, 42, 584 00:33:22,530 --> 00:33:23,295 and then 44-- 585 00:33:26,000 --> 00:33:27,800 maybe 44? 586 00:33:27,800 --> 00:33:29,090 I don't know. 587 00:33:29,090 --> 00:33:33,208 Does it matter, the order in which I put these things? 588 00:33:33,208 --> 00:33:33,750 I don't know. 589 00:33:33,750 --> 00:33:36,090 I'm just going to keep it the same order for now. 590 00:33:36,090 --> 00:33:38,670 All right, so I've sorted it by the least significant-- 591 00:33:38,670 --> 00:33:41,100 or the most significant-- sorry-- 592 00:33:41,100 --> 00:33:43,350 the leading term. 593 00:33:43,350 --> 00:33:46,720 And now I'm going to sort by the least significant. 594 00:33:46,720 --> 00:33:49,620 So what's the least significant here? 595 00:33:49,620 --> 00:33:53,950 22-- then 2 is also-- 596 00:33:53,950 --> 00:33:55,180 this is also 2. 597 00:33:55,180 --> 00:33:57,100 This is also 2. 598 00:33:57,100 --> 00:33:59,440 This is 3. 599 00:33:59,440 --> 00:34:01,630 And sorted list-- voila. 600 00:34:04,640 --> 00:34:05,690 Why did that not work? 601 00:34:09,100 --> 00:34:10,471 Yeah? 602 00:34:10,471 --> 00:34:17,034 AUDIENCE: [INAUDIBLE] 603 00:34:17,034 --> 00:34:17,659 JASON KU: Yeah. 604 00:34:17,659 --> 00:34:20,480 So what happened is I did take into account 605 00:34:20,480 --> 00:34:23,179 the significant digit sort, but when 606 00:34:23,179 --> 00:34:27,620 I did the less significant thing, it erased all of my work 607 00:34:27,620 --> 00:34:28,739 from up here. 608 00:34:28,739 --> 00:34:31,080 Does that make sense? 609 00:34:31,080 --> 00:34:36,780 In the case of ties, we want the more significant thing 610 00:34:36,780 --> 00:34:39,675 to take precedence, so we want to do that thing last. 611 00:34:39,675 --> 00:34:41,230 Does that makes sense? 612 00:34:41,230 --> 00:34:44,560 So the right way to do this-- 613 00:34:44,560 --> 00:34:54,415 this is the most significant first [INAUDIBLE] not good. 614 00:34:54,415 --> 00:34:56,040 All right, at least significant first-- 615 00:34:56,040 --> 00:34:59,010 let's try that. 616 00:34:59,010 --> 00:35:02,370 So least significant here is 2. 617 00:35:02,370 --> 00:35:16,520 OK, so I see 32, 42, 22, 03, and then 44. 618 00:35:16,520 --> 00:35:18,740 OK? 619 00:35:18,740 --> 00:35:20,160 Sound good? 620 00:35:20,160 --> 00:35:24,013 Least significant first-- now I do most significant. 621 00:35:24,013 --> 00:35:25,430 I sort the most significant thing. 622 00:35:25,430 --> 00:35:27,590 OK, so what's the most significant thing? 623 00:35:27,590 --> 00:35:37,610 03, 22, 32-- most significant four-- 624 00:35:37,610 --> 00:35:41,310 44, and 42-- cool. 625 00:35:41,310 --> 00:35:44,260 We're sorted, right? 626 00:35:44,260 --> 00:35:45,790 I did what you told me to do. 627 00:35:45,790 --> 00:35:50,750 I sorted by the most significant thing. 628 00:35:50,750 --> 00:35:51,950 What's the problem here? 629 00:35:58,920 --> 00:36:00,300 What did I do wrong? 630 00:36:00,300 --> 00:36:05,830 You wanted me to put 42 here and 44 here, right? 631 00:36:05,830 --> 00:36:10,120 Because 42 came first in the input and 44 632 00:36:10,120 --> 00:36:11,010 came second, right? 633 00:36:14,070 --> 00:36:18,510 OK, if a sorting algorithm maintains this property that, 634 00:36:18,510 --> 00:36:25,050 if they are the same thing, then the output 635 00:36:25,050 --> 00:36:29,400 maintains their order from the input to the output-- 636 00:36:29,400 --> 00:36:33,260 their relative order-- that's what we call a stable sorting 637 00:36:33,260 --> 00:36:34,190 algorithm. 638 00:36:34,190 --> 00:36:36,650 And so if we have a stable sorting algorithm when 639 00:36:36,650 --> 00:36:40,850 we're doing tuple sort, when we're sorting on different keys 640 00:36:40,850 --> 00:36:45,110 or columns of a set, we really want 641 00:36:45,110 --> 00:36:48,110 to be using a stable sorting algorithm. 642 00:36:48,110 --> 00:36:50,250 Does that makes sense? 643 00:36:50,250 --> 00:36:52,850 Because otherwise, we may mess up 644 00:36:52,850 --> 00:36:56,330 work we did before in a previous sort 645 00:36:56,330 --> 00:36:58,500 of the less significant things. 646 00:36:58,500 --> 00:37:04,610 And so yes, we want a stable sorting algorithm here, 647 00:37:04,610 --> 00:37:08,690 because then we will end up sorting our thing. 648 00:37:08,690 --> 00:37:10,190 Does that make sense? 649 00:37:10,190 --> 00:37:11,058 Yes? 650 00:37:11,058 --> 00:37:17,760 AUDIENCE: [INAUDIBLE] 651 00:37:17,760 --> 00:37:21,810 JASON KU: So what your colleague is saying-- 652 00:37:21,810 --> 00:37:26,220 let's sort by most significant, then look at all of the things 653 00:37:26,220 --> 00:37:32,530 with one of those that are the same, and now sort that. 654 00:37:32,530 --> 00:37:34,750 That's something we could do. 655 00:37:34,750 --> 00:37:36,140 How long would that take? 656 00:37:36,140 --> 00:37:43,220 Well, let's say I didn't use half of my more significant set 657 00:37:43,220 --> 00:37:44,810 of digits. 658 00:37:44,810 --> 00:37:49,960 Say I'm only using n/2 or-- 659 00:37:49,960 --> 00:37:51,750 that's not quite going to get what I want. 660 00:37:54,672 --> 00:37:58,517 AUDIENCE: [INAUDIBLE] 661 00:37:58,517 --> 00:37:59,350 JASON KU: Say again. 662 00:37:59,350 --> 00:38:01,875 AUDIENCE: We'll take n squared [INAUDIBLE].. 663 00:38:01,875 --> 00:38:02,500 JASON KU: Yeah. 664 00:38:02,500 --> 00:38:06,760 So what we're going to do, if we have direct access array sort-- 665 00:38:06,760 --> 00:38:08,920 if I then go into each one of these digits 666 00:38:08,920 --> 00:38:11,410 and try to sort the things that are in there, 667 00:38:11,410 --> 00:38:13,210 that's going to take time. 668 00:38:13,210 --> 00:38:16,810 It's going to take time for each of those digits. 669 00:38:16,810 --> 00:38:20,440 There might be a ton of collisions 670 00:38:20,440 --> 00:38:25,030 into one of the things, and so I might take more time 671 00:38:25,030 --> 00:38:26,860 to sort that than linear. 672 00:38:26,860 --> 00:38:28,250 Does that make sense? 673 00:38:28,250 --> 00:38:31,840 So I would prefer to do this tuple sort kind of behavior, 674 00:38:31,840 --> 00:38:34,990 sorting the smaller thing, sorting the bigger thing. 675 00:38:34,990 --> 00:38:37,390 And because I only have a constant number of things 676 00:38:37,390 --> 00:38:40,540 in my tuples, this is important, because I only 677 00:38:40,540 --> 00:38:43,150 have two things I'm worried about here. 678 00:38:43,150 --> 00:38:48,840 I only have to do two passes of a sorting algorithm 679 00:38:48,840 --> 00:38:51,180 to be able to sort these numbers. 680 00:38:51,180 --> 00:38:56,180 However, can I use direct access array sort here? 681 00:38:56,180 --> 00:39:00,320 What was the initial stipulation I had on direct access array? 682 00:39:00,320 --> 00:39:03,048 That the keys were unique-- 683 00:39:03,048 --> 00:39:05,090 that's exactly the opposite of what we have here. 684 00:39:05,090 --> 00:39:06,890 We have things that could be the same. 685 00:39:10,970 --> 00:39:12,320 So we give up-- 686 00:39:12,320 --> 00:39:12,850 can't do it. 687 00:39:16,093 --> 00:39:17,010 What do we do instead? 688 00:39:20,790 --> 00:39:21,290 Yeah? 689 00:39:21,290 --> 00:39:29,030 AUDIENCE: [INAUDIBLE] 690 00:39:29,030 --> 00:39:31,590 JASON KU: You've already said the thing that I'm looking for, 691 00:39:31,590 --> 00:39:32,750 so that's great. 692 00:39:32,750 --> 00:39:35,180 Your colleague said, why can't we 693 00:39:35,180 --> 00:39:39,140 just put more things at a key? 694 00:39:39,140 --> 00:39:41,530 Why can't we put a list there? 695 00:39:41,530 --> 00:39:42,662 That's exactly what we do. 696 00:39:42,662 --> 00:39:43,870 This is called counting sort. 697 00:39:48,340 --> 00:39:50,980 And what we do here is we still have this direct access 698 00:39:50,980 --> 00:39:56,140 array of space u minus 0 to u minus 1, 699 00:39:56,140 --> 00:40:03,390 but instead of storing one thing here at each key K, 700 00:40:03,390 --> 00:40:07,830 we store a pointer to a chain. 701 00:40:07,830 --> 00:40:10,350 This sounds like hashing, right? 702 00:40:10,350 --> 00:40:13,230 But the important thing is that I 703 00:40:13,230 --> 00:40:16,290 need to make sure, as I'm inserting things 704 00:40:16,290 --> 00:40:19,170 in here, that I'm maintaining the order in which they 705 00:40:19,170 --> 00:40:20,330 came in. 706 00:40:20,330 --> 00:40:21,930 I can't just throw them willy nilly, 707 00:40:21,930 --> 00:40:25,020 or else we have this problem up here that we had before. 708 00:40:25,020 --> 00:40:28,350 So I need what I would say is sequence data 709 00:40:28,350 --> 00:40:32,310 structure, something that will maintain the order that I-- 710 00:40:32,310 --> 00:40:34,140 the extrinsic order that I had when 711 00:40:34,140 --> 00:40:37,360 I'm putting these things in. 712 00:40:37,360 --> 00:40:43,920 So as I have multiple things with K, 713 00:40:43,920 --> 00:40:46,770 I'm going to put them in the order. 714 00:40:46,770 --> 00:40:50,160 I can put-- have a pointer to a dynamic array or a linked list, 715 00:40:50,160 --> 00:40:53,418 where I just add things to the end. 716 00:40:53,418 --> 00:40:54,960 And then, at the end of my algorithm, 717 00:40:54,960 --> 00:40:58,890 when I read off the things, I can just 718 00:40:58,890 --> 00:41:02,460 look at anyone that has a non-empty data 719 00:41:02,460 --> 00:41:04,980 structure under here and read them off in the order 720 00:41:04,980 --> 00:41:06,770 that they came. 721 00:41:06,770 --> 00:41:09,500 Does that makes sense? 722 00:41:09,500 --> 00:41:14,200 So for this example, I'm just going 723 00:41:14,200 --> 00:41:20,230 to do this last step here from the first row 724 00:41:20,230 --> 00:41:22,280 to the second row. 725 00:41:22,280 --> 00:41:28,030 I'm going to have this direct access array with 0, 1, 2, 3, 4 726 00:41:28,030 --> 00:41:28,705 on the slots. 727 00:41:32,110 --> 00:41:35,930 So how am I going to do this counting sort now? 728 00:41:35,930 --> 00:41:41,540 I have 32, 42, 22, 03, and 44. 729 00:41:41,540 --> 00:41:43,460 I can take the first one, 32. 730 00:41:43,460 --> 00:41:46,100 I'm sorting by the most significant thing. 731 00:41:46,100 --> 00:41:48,890 I stick it here-- 732 00:41:48,890 --> 00:41:53,484 32, and then 44-- 733 00:41:53,484 --> 00:41:59,360 42-- sorry-- 42, 22. 734 00:41:59,360 --> 00:42:06,800 This is not so much different yet then dynamic array-- 735 00:42:06,800 --> 00:42:08,120 direct access array sort. 736 00:42:08,120 --> 00:42:14,700 But when we get to this duplicate, 737 00:42:14,700 --> 00:42:19,430 44 here, we now have two things in this thing. 738 00:42:19,430 --> 00:42:24,230 And because we are keeping them in order in this sequence, 739 00:42:24,230 --> 00:42:26,060 I'm appending to the end. 740 00:42:26,060 --> 00:42:31,700 Then, when I go and read off the different things, 741 00:42:31,700 --> 00:42:35,000 then I'm returning them in a stable way in the way 742 00:42:35,000 --> 00:42:36,290 that I want them to be. 743 00:42:36,290 --> 00:42:38,620 Does that makes sense. 744 00:42:38,620 --> 00:42:40,380 And it's not overwriting the work 745 00:42:40,380 --> 00:42:42,300 I did on the lower significant digits. 746 00:42:44,900 --> 00:42:46,160 So how long does this take? 747 00:42:49,660 --> 00:42:56,290 This also only takes order n plus u, 748 00:42:56,290 --> 00:43:00,450 because I'm instantiating this thing of size u. 749 00:43:00,450 --> 00:43:02,730 And then, how big are these data structures? 750 00:43:02,730 --> 00:43:07,630 Well, maybe I'm storing one, a constant amount for each index. 751 00:43:07,630 --> 00:43:09,720 So that's a u overhead. 752 00:43:09,720 --> 00:43:14,220 And then I'm paying 1 for every item I'm storing. 753 00:43:14,220 --> 00:43:18,030 These things are only the lengths. 754 00:43:18,030 --> 00:43:21,360 The sum total of their lengths is n, 755 00:43:21,360 --> 00:43:24,990 because I'm only storing n things in there. 756 00:43:24,990 --> 00:43:29,870 So the total amount of space, the total amount 757 00:43:29,870 --> 00:43:32,480 of work I have to do is order-- 758 00:43:32,480 --> 00:43:36,200 I need to be able to spend in constant time 759 00:43:36,200 --> 00:43:38,390 and I need to be able to cycle through these things, 760 00:43:38,390 --> 00:43:40,550 iterate over them in linear time. 761 00:43:40,550 --> 00:43:42,830 But if I have that, I get n plus u. 762 00:43:42,830 --> 00:43:44,110 Yeah? 763 00:43:44,110 --> 00:43:45,800 AUDIENCE: How do you ensure that, 764 00:43:45,800 --> 00:43:49,167 within your linked list or your dynamic-- those elements, 765 00:43:49,167 --> 00:43:52,534 like four equals four-- how do you make sure 766 00:43:52,534 --> 00:43:55,460 that those are sorted? 767 00:43:55,460 --> 00:43:58,790 JASON KU: So your colleague is saying, 768 00:43:58,790 --> 00:44:01,280 how do I ensure that the things in these lists, 769 00:44:01,280 --> 00:44:03,920 where they collide, how do you ensure that they're sorted? 770 00:44:03,920 --> 00:44:05,270 I don't. 771 00:44:05,270 --> 00:44:09,740 I just ensure that they came in the order that they came. 772 00:44:09,740 --> 00:44:14,700 But as long as I sorted the lower order digits correctly 773 00:44:14,700 --> 00:44:19,140 in the previous things, then I'm assuming 774 00:44:19,140 --> 00:44:20,880 that their order as they come in will 775 00:44:20,880 --> 00:44:22,470 be sorted, if they collide. 776 00:44:22,470 --> 00:44:23,910 That's the assumption. 777 00:44:23,910 --> 00:44:27,997 That's the reason why I'm doing these building up 778 00:44:27,997 --> 00:44:29,580 from the least significant to the most 779 00:44:29,580 --> 00:44:33,690 significant is so that I know that, when they collide, 780 00:44:33,690 --> 00:44:37,675 the underlying stuff there is sorted already in the input. 781 00:44:37,675 --> 00:44:38,550 Does that make sense? 782 00:44:38,550 --> 00:44:39,422 Great-- yeah? 783 00:44:39,422 --> 00:44:45,246 AUDIENCE: So this array isn't as big as u. 784 00:44:45,246 --> 00:44:48,388 It's as big as n. 785 00:44:48,388 --> 00:44:50,680 JASON KU: I'm using a direct access array on the keys-- 786 00:44:50,680 --> 00:44:51,460 oh, this is n. 787 00:44:56,080 --> 00:44:59,140 So counting sort is general for any u. 788 00:44:59,140 --> 00:45:04,360 I just happened to pick u being n in this case when I broke 789 00:45:04,360 --> 00:45:06,010 this thing up into n squared. 790 00:45:06,010 --> 00:45:07,870 But this general concept is-- 791 00:45:10,440 --> 00:45:12,160 doesn't matter what I choose for u. 792 00:45:12,160 --> 00:45:15,100 Does that make sense? 793 00:45:15,100 --> 00:45:15,600 OK. 794 00:45:15,600 --> 00:45:18,870 But we will use that right now to sort 795 00:45:18,870 --> 00:45:20,150 larger ranges of numbers. 796 00:45:25,830 --> 00:45:27,120 This was exactly the idea. 797 00:45:27,120 --> 00:45:30,150 We're going to combine tuple sort, use counting 798 00:45:30,150 --> 00:45:32,910 sort as its auxiliary sorting-- 799 00:45:32,910 --> 00:45:38,040 stable sorting algorithm to do all its work on these digits. 800 00:45:38,040 --> 00:45:45,090 And so to sort of on n squared size numbers, 801 00:45:45,090 --> 00:45:47,340 I get linear time, which is great, 802 00:45:47,340 --> 00:45:49,620 because u is n in this case. 803 00:45:54,890 --> 00:45:56,120 But can I extend that? 804 00:45:56,120 --> 00:45:59,210 What if I had n cubed? 805 00:45:59,210 --> 00:46:04,700 What if I had up to size u equals n cubed, or less than n 806 00:46:04,700 --> 00:46:05,882 cubed? 807 00:46:05,882 --> 00:46:07,340 How many digits would I have there? 808 00:46:12,290 --> 00:46:16,510 How many size n digits what I need to represent 809 00:46:16,510 --> 00:46:18,400 a number of size n cubed? 810 00:46:21,710 --> 00:46:22,400 Any ideas? 811 00:46:24,920 --> 00:46:26,270 What did we do here? 812 00:46:26,270 --> 00:46:28,580 We divided off an n. 813 00:46:28,580 --> 00:46:30,050 We took it and stored it. 814 00:46:30,050 --> 00:46:32,750 We're left with something of size n. 815 00:46:32,750 --> 00:46:36,680 If I had a number of size n cubed, I could divide off an n. 816 00:46:36,680 --> 00:46:38,540 I'm left with something of n squared. 817 00:46:38,540 --> 00:46:40,910 I don't know how to deal with something of n squared. 818 00:46:40,910 --> 00:46:42,530 Actually, I do. 819 00:46:42,530 --> 00:46:46,130 I can split it up into two size n numbers. 820 00:46:46,130 --> 00:46:52,210 So if I had numbers bound-- upper bounded by a cubic-- 821 00:46:52,210 --> 00:46:56,080 n cubed-- I could split it up into three digits. 822 00:46:56,080 --> 00:46:57,850 Three is still constant. 823 00:46:57,850 --> 00:47:01,030 And so I could split it up into three digits, 824 00:47:01,030 --> 00:47:06,310 tuple sort them in their increasing priority, 825 00:47:06,310 --> 00:47:07,300 and sort those. 826 00:47:07,300 --> 00:47:10,660 Again, I'm doing linear work per digit. 827 00:47:10,660 --> 00:47:12,200 I have a constant number of digits, 828 00:47:12,200 --> 00:47:14,500 so I get a linear time algorithm. 829 00:47:14,500 --> 00:47:15,293 Yeah? 830 00:47:15,293 --> 00:47:18,040 AUDIENCE: When it comes to sorting [INAUDIBLE]---- 831 00:47:18,040 --> 00:47:18,960 JASON KU: Uh-huh. 832 00:47:18,960 --> 00:47:21,755 AUDIENCE: Are you ensuring that that runtime is also 833 00:47:21,755 --> 00:47:23,705 big O of n plus u? 834 00:47:23,705 --> 00:47:24,330 JASON KU: Yeah. 835 00:47:24,330 --> 00:47:28,080 So it's always going to be big O of n plus u, 836 00:47:28,080 --> 00:47:33,300 but because I'm bounding my digit size to be n, 837 00:47:33,300 --> 00:47:36,100 u is n there, and so I'm getting linear time. 838 00:47:36,100 --> 00:47:37,180 Does that make sense? 839 00:47:37,180 --> 00:47:37,680 Yeah. 840 00:47:37,680 --> 00:47:39,580 So the idea here-- 841 00:47:39,580 --> 00:47:42,090 this is what we call radix sort. 842 00:47:42,090 --> 00:48:01,180 Radix sort-- break up integers, max size u, 843 00:48:01,180 --> 00:48:10,645 into a base and tuple. 844 00:48:14,470 --> 00:48:17,650 So basically, each one of my digits can range from 0 to n. 845 00:48:21,790 --> 00:48:24,760 How many base n digits do if I have a number of size u? 846 00:48:28,590 --> 00:48:30,690 Yeah, log n of u-- 847 00:48:30,690 --> 00:48:36,900 number of digits is log n of u-- 848 00:48:36,900 --> 00:48:38,130 log base n of u. 849 00:48:41,930 --> 00:48:56,060 And then tuple sort on digits using counting sort, 850 00:48:56,060 --> 00:49:05,190 from least to most significant-- 851 00:49:05,190 --> 00:49:06,780 that's the algorithm. 852 00:49:06,780 --> 00:49:10,200 How long does that take? 853 00:49:10,200 --> 00:49:13,980 How long does it take to sort on a digit that 854 00:49:13,980 --> 00:49:17,550 spans the key 0 to n? 855 00:49:17,550 --> 00:49:18,690 Linear time, right? 856 00:49:18,690 --> 00:49:23,790 Order n time-- how many times do I have to do this tuple sort? 857 00:49:23,790 --> 00:49:27,060 The number of digits times, right? 858 00:49:27,060 --> 00:49:29,370 So the running time of this algorithm-- 859 00:49:29,370 --> 00:49:35,230 first, I have to do this stuff, break up each of the integers. 860 00:49:35,230 --> 00:49:37,420 That takes n time-- 861 00:49:37,420 --> 00:49:38,680 n times the number of digits. 862 00:49:38,680 --> 00:49:42,100 I had to create each one of these tuples-- 863 00:49:42,100 --> 00:49:47,850 so n plus n times the number of digits-- 864 00:49:47,850 --> 00:49:54,900 log base n of u. 865 00:49:54,900 --> 00:49:58,320 So here I had to loop through all the things. 866 00:49:58,320 --> 00:50:01,170 And then here, for each thing, I broke it up 867 00:50:01,170 --> 00:50:09,500 into log base n of u digits, and that's 868 00:50:09,500 --> 00:50:11,480 how long the first thing took. 869 00:50:11,480 --> 00:50:15,250 And then, how long did it take me to tuple sort? 870 00:50:15,250 --> 00:50:17,590 n time per digit-- 871 00:50:17,590 --> 00:50:19,030 so I also get this factor. 872 00:50:19,030 --> 00:50:20,220 Does that make sense? 873 00:50:23,790 --> 00:50:24,630 How long is that? 874 00:50:24,630 --> 00:50:25,240 Is that good? 875 00:50:25,240 --> 00:50:27,010 Is that bad? 876 00:50:27,010 --> 00:50:30,430 For what values of u is this linear time? 877 00:50:35,500 --> 00:50:43,390 If u is less than n to the c for some constant c, 878 00:50:43,390 --> 00:50:48,430 then the c comes out of the logarithm, log n of n is 1, 879 00:50:48,430 --> 00:50:50,155 and we get a linear time algorithm. 880 00:50:50,155 --> 00:50:52,130 Does that makes sense? 881 00:50:52,130 --> 00:50:52,640 OK. 882 00:50:52,640 --> 00:50:56,060 So that's how we can sort in linear time, 883 00:50:56,060 --> 00:51:00,920 if our things are only polynomially large. 884 00:51:00,920 --> 00:51:04,100 So in counting sort, we get n plus u. 885 00:51:04,100 --> 00:51:07,910 In radix sort, we get also a stable sorting algorithm 886 00:51:07,910 --> 00:51:14,140 where the running time is n plus n times log base n of u. 887 00:51:14,140 --> 00:51:16,700 Does that makes sense? 888 00:51:16,700 --> 00:51:20,855 And then, in the situations where-- 889 00:51:20,855 --> 00:51:22,480 there's a typo there in counting sort-- 890 00:51:22,480 --> 00:51:25,270 that should be when u is order n-- 891 00:51:25,270 --> 00:51:27,490 counting short runs in linear time. 892 00:51:27,490 --> 00:51:31,510 And it's linear time also in the case of rating sort, 893 00:51:31,510 --> 00:51:35,380 if our things are bounded by a polynomial in n, 894 00:51:35,380 --> 00:51:38,770 by n to the c for some constant c. 895 00:51:38,770 --> 00:51:41,180 Does that make sense? 896 00:51:41,180 --> 00:51:45,350 All right, so that's how to sort in linear time, 897 00:51:45,350 --> 00:51:47,690 with the caveat that your numbers aren't too big. 898 00:51:47,690 --> 00:51:51,700 OK, see you next week.