This session focuses on preparing for the quiz. High-level concepts are discussed and example problems are worked. Topics include the Master Theorem, data structures, and sorting.
Instructor: Jason Ku
JASON KU: All right. Welcome, everybody. Everyone ready for the quiz?
JASON KU: Quiz next week. Yes. I hope you all are here because you know that there's a quiz next week. OK. So what is this quiz about? It's about what we've talked about so far in this class of course. What is this class about? Someone remember from my first lecture? What is this class about? What are we trying to test you in this class?
JASON KU: Algorithms. Great. Also data structures. Right? That's what the first part of this is. But really it's to get you to solve computational problems. That's the first thing. Be able to argue to someone else that you actually did solve it. Right? It's correct. Right? That you chose something that's better than other things, that it's efficient. Right? And that you can communicate those things to other people. Right?
Those are my big four that I try to get you guys to internalize. And so that's what our quizzes are going to try to evaluate you on. OK. And so aside from some nitty gritty kind of stuff that we do at the beginning of the term like talking about our model of computation. Right? Our model-- and asymptotics. Asym-- sym-- totics. Is that right? Recurrences right?
Aside from these kind of basics, we delve straight into algorithms. Right? These are kind of like almost definitions. We don't rely on these things very much. I mean, we rely on these things all the time. But it's kind of the mathematics that we use to talk about things. Right? How can we even say how long this stuff takes unless we can reason-- we can abstract a way that this stuff is not on a real computer.
This is kind of in our minds, in a computer. And we're reasoning about these things based on the number of constant time operations this magical computer might have, which is a pretty good representation of any of the computers you have for certain assumptions. Right?
We're not-- I mean, there's not a lot of problem set questions we had you talk about. These things-- specifically-- usually they were part of some other problem. Right? You had to describe the running time of this thing. And you might have had to solve a recurrence. And you might have uses master theorem. That kind of thing. Right? Or you use asymptotics all the time. Or you need to remember in our model, oh, it kind of matters how big of an integer I can store and do arithmetic on in constant time. Right?
So for that p-set 3 question you had at the end of the coding question, you wanted to hash things. And you need to argue that those things fit in a constant number of words so that could be done in constant time. A lot of you guys found canonicalizations basically mapping to things that were exponentially large. So maybe multiplying a product of primes or something like that.
And that would not be a good representation. OK. So these things come up, but they're not the main focus of the problems we solve. What are the main focus of the-- how do we solve a computational problem in this class? I gave you two ways at the beginning of the term. Do you guys remember? We can solve-- how to solve a computational-- computational problem.
One's the hard way. One's the easy way.
AUDIENCE: Brute force.
JASON KU: Brute force. OK. So you're describing to me a technique for making your own algorithm. Right? I can just design my new algorithm from scratch. One way I could do that is brute force it. Right? Look at all the possible outputs and see which one works. Right? Or I could reduce to something like divide and conquer or something like that.
That's generally a hard thing to do-- to make your own algorithm. Right? That's why we don't ask you to do it a lot in this class. It's an 046 kind of thing. Right? So the first thing you could do is write design a new algorithm from scratch. Usually it's not from scratch. Right? Usually you're reducing to some kind of algorithmic design paradigm that you've maybe heard of.
You'll talk a lot more about it in 046 and at the end of this term when we talk about dynamic programming. But generally, that's a hard thing to do. Right? You're trying to think of a recursive algorithm. Right? You're trying to prove that it's correct-- all these nitty gritty-- we have actually, throughout the class in lectures, have been showing you the algorithms. But we're not really expecting you to make those algorithms.
What are we expecting you to do most of the time. Yeah? To reduce it to a problem that we showed you how to solve. Right? I'm going to say thing here, but really what I mean is an algorithm that we've taught you to-- basically here-- reduced to a known thing-- to known thing. Usually that means there's a problem or an interface that we've given you.
And in general, we've shown you multiple different ways to solve that problem or interface. Right? So we've shown you many ways on how to sort things. And we've shown you many ways on how to implement sequence and set interfaces. Remember?
And a lot of times the types of problems that we're asking you to do is to just use as a black box some of the things that we did. But you need, as a programmer, as a computer scientist, you need to tell me when I should use what went. Right? Yeah?
AUDIENCE: Could you clarify what you mean by use as a black box?
JASON KU: Use as a black box. Exactly. Right. So this is a phrase that I use and a lot of people in computer science use. It's basically you import a library into your code. Right? What do I have? I have an API. I have a way to interact with that code. I don't actually know what's going on inside of that library. I'm using it as a black box. It's opaque to me. I cannot look inside-- I can actually probably could look inside what their code is, but I'm not going to.
The thing that makes it useful to me is that has this useful API that I trust it to do the things that you told me that it was going to do. Right? And so there's kind of-- I'm going to jump around a little bit here actually, because that's a great question. So here I think of there's three different types of problems that we talk about, that we give you in this class. You might have seen this on the problem set. Right?
I like to categorize them into three different categories here. One is you have to understand the internals of a data structure, an algorithm that we know. You have to be able to look inside and, I don't know, given a node in a balanced binary search tree-- an AVL tree-- how can I do a rotation? Right? Or how do I do an insert? Or something about the structure of this thing-- a binary heap-- where are the top k things in a max binary heap, which is on your problem set.
Those things require me to very much not black box these data structures. It's a white box. Right? I need to know what's inside of that to answer that question. Right? I need to know about the internals of that data structure. Right?
And there are other types of problems where it's like, oh, I don't need to know what the internals of this data structure is. I can just operate with knowledge of the API and try to hook it in to the problem that I need. And that's what I call a reduction type problem. This is how does the core material we presented to you in lecture work? This is how do I apply that core material?
And harder than both of those things is what I might call a modification type of-- these aren't really good names. I came up with these this morning. But it's trying to get at the idea here that it's possible that you need to know what the API is, and you need to know what's going on inside to be able to answer the problem. Things like adapting a divide and conquer algorithm.
Or making-- instead of using a dynamic array that has extra space on one end, maybe I have to put extra space in the middle or something like that. Right? I'm adapting something that was from the core material. It's pretty close. But I have to modify it in some way. Augmentation right?
I have to take the vanilla said AVL tree that I might have given you and put some other property on the nodes and you need to tell me how to maintain that. How can I compute that subtree property from its children? Does that make sense? So this is the harder of the things. Right? If you can identify which one of these-- a problem that you look at on an exam-- [INAUDIBLE] under-- maybe that can help you conceptualize what should I use.
For a reduction type problem-- I'm going to talk about this in a second-- but a lot of times it's useful to reduce it to a problem or an interface rather than an algorithm or a data structure. What does that mean? If I can solve the problem by saying reducing to sorting, I can argue to you that that algorithm is correct. I just use sorting as a black box. Now it might not be efficient.
My choice of sorting algorithm that I chose matters for efficiency. But for correctness, it doesn't matter. Right? For a data structures problem I might reduce to using two set data structures and sequence data structure or something like that. But it will be correct if I reduce it to those things. I can define the operations in terms of those interfaces. I don't have to make that choice until I talk about running time. Right? Until I talk about efficiency.
And the name of the game on the quiz to get points-- we can't give you points for an incorrect algorithm or something that's pretty close to correct. Right? And we can't give you full points unless the correct algorithm you give us is efficient and that you've argued things like correctness and running time and things like that. Your algorithm could be correct and efficient, but you analyzed the running time incorrectly so we mark you points off for there.
Or most of the time what you do is you present us with an inefficient algorithm. And then you analyze the running time as if it's the target running time that we gave you. Right? That's bad on two fronts. Right? OK. So try not to fall into these traps. OK.
So some general test taking strategies when you're looking at your quiz. I really strongly urge you to read through the entire exam before you start because some of the problems will be easier for you than others. And if you're trying to maximize points on here as all of you, I'm sure, are trying to do, it's useful to make that initial pass through the problems to see which ones are easiest for you.
And then you can tackle them in the order in which you have confidence. Now in actuality, the average on say quiz one of this class tends to be around, I don't know, between 60 and 80. I don't think it's ever been 80. But it's not 100. So doing 50% of the problems well is probably going to be better for you in terms of time management and those kinds of things than doing all of the-- attempting all of the problems and not doing great on any of them in terms of point-- you have to, in computer science, you have to be pretty close to a correct answer to get points. Right?
It basically needs to be almost correct or you don't get-- if you've seen how your problem sets are being graded-- sometimes our problem set graders make mistakes. Sometimes they give you points for an incorrect solution. Right? It's really on you to take a look at your problem sets that you gave us and our solutions that we gave you. We spent a lot of time writing good solutions for you guys.
You need to make sure that the material. Don't come up to us at the end of an exam and say, oh, I said the same thing on my problem set. It was marked correct and you guys marked it wrong. Well, yeah. The staff knows a little bit more about algorithms than your graders on your problem sets. And we grade your exams. So unfortunately, that's not an excuse. It's on you to know the material. Yep?
AUDIENCE: So are most of the problems on the exams multiple part things where we need to have a good understanding and do the first few parts in order to get even partial credit on the other parts? Or is it like [INAUDIBLE]?
JASON KU: Yeah. So the question is, are questions built on top of each other so that you're kind of at a wall if you missed the first part? We try not to design exams that way. OK. You can actually take a look at the practice exam that's already been posted. Our problems tend to be self-contained. And if they are multiple parts, the parts are usually independent.
Usually you don't need to have done A correctly in order to do B correctly. All right? And that's how our problem sets try to be written as well. For your last coding question and 4 you had this problem where you had to design this data structure. But C said use that as a black box essentially and solve the problem. Right?
So you actually don't need to show-- we've given you this interface. You can just use that interface to be able to answer C, the algorithms question, without even solving the data structures question correctly. Does that make sense? And actually, the algorithms question was the easier one there, I think, from what I remember. Yeah?
AUDIENCE: Do we have to write code on the exam?
JASON KU: Do you have to write code on the exam? I've never given an exam where you've had to write code. I have written exams where you have to read code. OK? So pseudocode or Python-- since Python's a prerequisite for this class, it's completely fair game that we give you small snippets of Python code and you have to be able to understand what's going on. Yeah?
AUDIENCE: You listed amortization under the modification [INAUDIBLE]. Does that mean we're using an amortized [INAUDIBLE]?
JASON KU: Yeah. What I mean here-- amortization certainly appears in here or here-- these kinds of things. Oftentimes if I'm using a dynamic array, if I'm using a binary heap, if I'm using a hash table, amortization will appear here in our running times for those dynamic operations. What I mean here in amortization-- I mean if I'm asking you to generalize something that we've done like with dynamic arrays where instead of adding additional space at the end, I'm putting additional space in the middle or at the beginning or something like that.
And you're having to do some kind of amortized analysis. Now, often it's unnecessary to do this in-- when we talked about a problem where we did do our own amoritized analysis and making a double ended deck-- I mean a double ended queue-- you could actually solve it by reducing to using two dynamic arrays. Right? So there's a lot of ways in which you could reduce to using things. But you might have to do some additional bookkeeping at the end.
But what this is saying is that these are more-- you're not using things as a black box. You're changing something about the boxes that we gave you. Right? Does that make sense. Yeah?
AUDIENCE: If you tell us to write an algorithm that does something like O(log n) time, and we can think only of an algorithm that does something inefficiently like O(n) at end time.
JASON KU: OK.
AUDIENCE: Then is there any point writing that?
JASON KU: Sure. So let's actually move on for a second. I actually-- I'm going to answer your question very soon. OK. But I'm going to get to it in a second. OK? If I don't answer that question in five minutes, please let me know. OK? So the first thing, when I'm approaching a problem on the exam, I might try to ask some questions about the problem. OK.
It's going to help me decide what to use. Right? Different than your problems sets. Your problem sets-- basically, what do you use is what did we talk about in lecture that week. On a quiz you have eight lectures that you've talked about. And so this is going to be a harder thing for you to do because you don't know which of the eight lecture material is going to apply to this problem. And it could be a combination of them actually.
And so I'm trying to give you ways of answering that question faster. OK? So is this a mechanical reduction or a modification type problem? That's just going to help me determine the difficulty level of what this is. You might not be able to answer it. But it can give you a sense for what kind of problem it is.
Is this a problem about data structures sorting both? Right? If it's about data structures, do I need to support sequence type operations or do I need to store an extrinsic order on something? Or is it a thing where I care about what the objects are? I'm trying to look things up by what they are.
Or maybe both? Or maybe some combination? If I have a bunch of different types of keys that I might want a query on, I might have to use at least two set type of data structures. Right? You could get very complicated with these things. But putting it in terms of well, I'm going to need to do this kind of operation on these names. Then I can think, oh, I need a set data structure there. I'll think about how to implement that later.
Should I use a hash table? Should I use a sorted array? Should I use a AVL tree? But thinking about it first at the abstract level of I need a set data structure here can help you compartmentalize correctness versus efficiency. Does that make sense? OK. If you're stuck, this is your question. If you're stuck, write down a correct algorithm that's inefficient.
We can give you points for a correct algorithm that's inefficient. At least it's a correct algorithm. That's better than other things. Right? Now, if it's exponential time, you might be limited to 10% or 20% of the points. But if it's a log factor-- worse, or linear factor-- worse. Maybe that's OK. On a data structures problem, if any operation takes order n time, that's probably not going to give you a lot of points because the whole point of the data structure is to make those operations fast.
But if it solves the problem, you'll get some points. You won't get 0 points. Yeah?
AUDIENCE: Will we get any questions that are like how fast can you make this? Make this as fast as possible.
JASON KU: Yeah. So a lot of times we'll say, give us an efficient algorithm. OK. It's like, whoa, I don't know if it's efficient or not. Well, that just means that faster running times are going to give you more points. OK? So in questions like that, it's mostly trying to play this game of-- usually, we'll put an efficient one in not in terms of a data structure because usually, the data structures problem-- it's important in your implementation that these data structure operations be fast.
And we want to tell you how fast is. Right? So data structures questions in general, there's usually a trade-off between running times of these different operations. And it's really important how they relate to each other. And so for data structures problem, it's kind of about getting those running times. OK?
With an algorithms problem where we ask you to do one thing and we try to do it as fast as possible, try to get linear time. Right? Most of the time you can't get better than linear time if you have to read the entire input at some point if I want to find the things in my data. And if you can't think of a linear time algorithm, think of an n squared thing or think of n log n thing. Right?
Maybe that's a little hard for you guys to think of right now. But that's why I'm saying start with any correct algorithm and then maybe you can optimize. Maybe you can use a better data structure to make it more efficient. Does that make sense? Any other questions? OK. Moving right along. OK.
Here's some downsides. OK. If you find yourself doing one of these things three things, take a step back. You're probably doing something wrong. OK? So question yourself if you're trying to compute decimals, rationals, or real numbers. I can't store those things on a-- I mean, I can store decimals to finite precision. But if you're doing finite precision, you might as well multiply your numbers by that fixed precision and deal with the integers. Right?
We only have taught you how to deal with the integers in this class. Right? We haven't even shown you how to efficiently compute on rationals and real numbers. We have told you if you have a denominator and a numerator of a fraction, I can take two fractions and compare them right in constant time by doing cross multiplication. But if I'm trying to actually do this division to arbitrary precision, that's not happy because I can't even represent that on my computer in a finite number of decimal points for some of these things.
If you're trying to use Radix sort for every answer, it's probably wrong. One of the things that we try to do on our quizzes-- we're not just giving you a bunch of problems randomly. We probably are making problems that cover the material in some way. We do want to test you on all of the things. And so if you find that you're using the same thing four or five times on the exam, that might be a sign that you're using it too many times.
It's not always the case. Right? Sometimes hashing is super useful, so you want to use it all the time. But in particular, everyone tries to use Radix sort when it's inappropriate. And they love because it gets linear time. Right? But if you write merge sort for something where Radix sort will apply, you'll get some points because it's correct but not efficient. And it's inefficient by a log factor. Right?
If you're trying to use Radix sort in a situation where comparisons are the answer and you don't have a bound on the integers, if I don't have a bound on the size of the integers, then this may be taking a huge amount of time. So I might not even think of that as being correct because it could be exponential time. I mean, I don't know. I don't know how big my word size is. It could be arbitrarily bad. OK.
And then if you're trying to augment a binary tree with something that's not a subtree property-- something that can't be computed from the augmentations of its two children-- you're doing something bad. Every exam we have 30% of students say augment by my index in the entire tree or augment by-- here's one that's fun-- augment by the size of my left subtree-- the number of nodes in my left subtree.
How can I-- I'm not sure how I can maintain that with rotations and things like that. Let's see. In order for me to keep track of the augmentation of my left tree from the augmentation of my left tree, I have to do a logarithmic walk all the way down the thing to figure out how many things were there. So that's not a maintainable thing in constant time.
If I want to augment something of my left subtree, just augment the thing itself and just look at your left subtree and look at its augmentation. Does that make sense? Yeah?
AUDIENCE: If you had say, augmented it with a subtree and then augmented it again with [INAUDIBLE] subtree and then use that--
JASON KU: You could do that.
AUDIENCE: --does that still count as--
JASON KU: You could do that. Yeah. So you could do that in constant time by augmenting by subtree size. Right?
AUDIENCE: And then have another augmentation in [INAUDIBLE].
JASON KU: You could have another augmentation because then you could just look at-- but then just look at your left subtree please. Yeah. No reason to store it again. Right? You just do one constant time-- look to your left. OK. Don't do that. OK. So that is-- cool. I'm not that far behind. Those are my tips on solving questions. Oh, there's one more page. Yeah?
AUDIENCE: So when defining an augmentation, [INAUDIBLE] do the formula and argue that it's [INAUDIBLE].
JASON KU: Exactly. Right. So the idea is if you give an augmentation that's not a-- we're going to talk about our standard things you can reduce to in a second. If you're saying, I'm going to take a set AVL tree or a sequence AVL tree-- and for example, at the end of lecture, we were talking about how a sequence AVL tree could be modified to support priority queue operations in the same running times as binary heaps. Right?
And that was saying we store our subtree maxes-- the max thing in my subtree. Right? And so that's a different augmentation than what's already augmented on the sequence AVL tree. What are the augmentations on a sequence AVL tree?
JASON KU: Size.
JASON KU: So counts is the same thing as how many nodes are in my subtree. And?
JASON KU: Height. Right because it's an AVL tree. Right? So if I'm augmenting by max in my subtree, that's not part of my standard interface. So you need to tell me that. Even though we've done it before, it should be very easy for you. Just say, I'm augmenting by my max. Max can be computed as the max between me and my left and right subtree if they exist. Done. Right? But just do that.
You have to tell me-- and it takes constant time, so it can be maintained at constant time when I'm doing my stuff. Does that makes sense?
OK. So the last thing, I guess especially on the data structures problems, I would suggest that you approach these things by solving these problems just in terms of the interfaces first. Because then at least you get something that's correct. And then choose the algorithms or data structures that you use to implement those interfaces afterwards.
One gets you to a correct algorithm. The other is for efficiency. Decoupling these might help you in solving the problem. If it doesn't help you, don't. If you're like, whenever I see a set data structure, I'm going to probably use a hash table, that's probably fine. But if we're looking for worst case time bounds, that's probably not fun.
So you just-- I'm suggesting that you separate these things so that you concentrate on solving the problem first and then optimize it later. Yeah?
AUDIENCE: Just a question in regards to worst case time bounds.
JASON KU: Mm-hmm.
AUDIENCE: So for a hash table.
JASON KU: Mm-hmm.
AUDIENCE: Given that's a code one expected.
JASON KU: Mm-hmm.
AUDIENCE: That also implies that's O of n worst case. So could you --technically
JASON KU: It doesn't imply that. It is that. So, I mean, you could have a data structure whose expected time bound is constant, but it's a worst case bound is n log n. It just happens to be the fact that for a hash table, those worst case operations are linear. But if I --had we had a question up here beforehand, if I had a running time bound that I did something to a hash table in constant expected time, I did a look up, and then I queried an a AVL tree for the predecessor of a node or something like that, and I did that in O of log n time, what's the worst case running time of that?
AUDIENCE: O of n.
JASON KU: O of n. What's the expected running time of this thing?
AUDIENCE: Log n.
JASON KU: Log n. Expected log n. Because it's possible that in the worst case it could be higher. Does that makes sense? OK. --so OK. The second bullet is just setting up a data structures problem. There's a lot of moving parts. We're going to do two data structures problem at the end of this session.
Describe all of the data structures you're using, including what they store. If you're storing a set data structure, you better tell me what the things you're stored are keyed on. Usually the things that we're storing contain a bunch of information, and if you just say I'm storing all of the toppings of my pizza in the set data structure, and that's all you tell me, I have no idea what you're talking about because I don't know what the semantics of your data structure are. What is it keyed on? I have to say, oh, it's keyed on, I don't know the y's or something like that.
OK. And there are invariants. How we're setting up these data structures problem, usually how I solve these when I'm writing the solutions, I set up a state of what this data structure could be at some instance. I'm going to say, this data structure stores all of the things less than --k with key less than k, blah, blah, blah. And this one stores the extrinsic order of the items based on blah, blah, blah. OK.
So actually, me stating what they store in that way is actually imposing some kind of invariant on these data structures that I'm wanting to maintain. But what I need to do to prove that this thing is correct is that based on the assumption that those invariants held before my operation. Then I can prove that an operation is correct if all of those invariants are maintained before, then after the operation. That's kind of how I'm proving that this thing is correct.
And then when I'm querying, I'm doing some kind of lookup on this data structure, I can rely on those invariants. I know that those things are good, those things have been maintained, and so I can rely on those to look up what's the largest k of this thing. Does that make sense? If this is very abstract, we're going to get a little bit more concrete in just a second.
And then implement every operation. You have no idea how many solutions we read on the quizzes, which we give you three operations to implement and you don't even mention one. And it's usually the easiest one. It's like inserted into your data structure. It's like, come on, just say that. We can't give you points unless you mention that operation. Does that make sense?
And then it's going to help --us happy graders give you more points. Not really, but if your solution is well-organized and well-labeled and things like that, then we're going to be able to comprehend your solution better and we'll be able to give you more points. Remember, part of this class is about communication. If your thing is correct but we can't tell what you're saying, then it's not correct. OK. All right. So now we get --to any questions on that? Yeah.
AUDIENCE: Just a question on invariants. So one of the data structures that we've discussed in the class previously to state like what the invariants are, like the set AVL tree rule.
JASON KU: Right. So if it's the standard things, we were going to talk about what the standard things are now. Then you don't need to re-argue or --restate I mean, you can basically say like, because the set and sequence interfaces are defined that way, these things are correct, like --almost you can --basically you're trying to convince us that why it's correct.
If you're correctly using a set or sequence data structure in these data structures-type problems, then unless you're using it in a way that is unusual, usually you can just rely on the properties of the set and sequence data structures that we gave you. I do want you to mention that you thought about correctness. Like this data structure is correct because.
Just write a sentence saying that and arguing that you're basically maintaining the invariance of your data structure kind of at the upper level. What kinds of things is this data structure storing? What things about the global data structure are we relying on to make query operations? As long as you are convincing that after a dynamic operation when modifying the data structure that those invariant stay the --same that are still satisfied, then that's really all you need to say.
These invariants are satisfied because of the definitions of a set and sequence data structure. A lot of times it doesn't require a lot of thought for --the we're not --asking the reason why we do reduction problems is so you don't have to do a lot of work to prove to us that it's correct. We have these really nice black boxes. They are correct. We proved them to you that they're correct, and so you don't have to redo that work.
So now we're going to go through the core material I like to think about in this class. The first part of this class, aside from these mathematical tools that we developed at the beginning of the course, it's mostly about solving problems involving data structures. And we motivated this problem of sorting by saying that a sorted array is a data structure that's actually pretty useful. But how do we sort those things? Well, we showed you a bunch of ways to do that.
And this is that nice table. Lots of stuff. Why do we show you so many sorting algorithms? Why don't we just give you one algorithm? Hmm? Yeah?
AUDIENCE: --run times.
JASON KU: Different run times.
AUDIENCE: Better for --different
JASON KU: Better for different scenarios.
AUDIENCE: Yeah. They each have their own individual points, then they do better.
JASON KU: Yeah. You'll notice in this table, there's not blue all the way across for any of these things. So some of them are better for different scenarios. And actually, these comments list some special cases where these things might be better. In actuality, this ultra-blue thing is saying like this could be linear time. That's better. But in some cases this is worse than all the other things.
So be a little wary of this blue color here. Generally we're trying to get you further down in this chart if you can. And in general, like for example, merge sort, AVL sort, these are really the same in terms of asymptotic complexity and the way in which you interact with these sorting rooms.
But there are special cases where you might use insertion sort or selection --sort actually, I'm not sure about insertion sort. You had in your recitation two days ago-- I think you guys showed how to do if you had a k-proximate array where things are not more than k away from each other. Insertion sort actually runs in n times k, and so if k is small, then that's really good, that's kind of like linear. But you can actually do even better with a binary heap, which you saw in recitation, hopefully, where you can get that down to n log k by --keeping maintaining a heap as you go across and finding the max that way.
So insertion sort maybe is not so great. But selection sort, the name of the game there is that if my reads are cheap but my writes are expensive, selection sort it actually does pretty well, because I only have-- sorry. Reads are cheap and my writes are expensive. Selection sort does a linear number of swaps. It's looking down, finding the max, swapping it in, and keep it going. And so in such cases, that's actually better than any of these other algorithms that we've got. Yeah?
AUDIENCE: Those heap --sorts
JASON KU: Mm-hmm.
AUDIENCE: --time, worst case are expected.
JASON KU: This is worst case. So it's a little hard to-- there was a lot of moving parts to get to this bound in Tuesday's lecture. Basically what we did, we showed you how to think of an array as a heap-- as a binary tree. Complete binary tree. It's not an AVL-- I mean, it is an AVL tree, but AVL trees are weaker than a complete tree. Height balance is a weaker property than a complete.
The reason why we use complete is because it's unique for a number of nodes. That way, when I give you one array with a fixed length, I know exactly what tree you're talking about, because there's a one-to-one mapping there. If there was some ambiguity on what tree I was talking about, I --wouldn't heaps just wouldn't work.
So what heap sort does is it has this correspondence between arrays and binary trees. And then what it does is it provides these operations that kind of only do operations at the end. And then the in-place optimization is that, well, instead of actually popping it or pushing it onto the back of an array, I'm just going to think of a subset of my array as a heap and then always kind of pooping the max out to the end. Just leaving it behind and thinking of my heap as a smaller subset.
And that's how we --got it didn't actually de-use any amortization. For the time bound, time bounds you could actually do this with the amortized basically dynamic array version of this. The time bound doesn't rely on that. The in-place relies on keys staying all within one array. Does that make sense? You're doing a bunch of amortized operations, so that this actually does achieve worst case n log n. Yeah?
AUDIENCE: It seems that the things we learned--
JASON KU: Mm-hmm.
AUDIENCE: --not like--
JASON KU: Mm-hmm.
AUDIENCE: They tend to be better. So assume that most algorithms than we want them to be--
JASON KU: Uh huh.
AUDIENCE: --faster. It seems like people will not use as often, especially in these [INAUDIBLE] insertion and selection.
JASON KU: Right, yeah. So there are these special cases where they're good, but, I mean, generally these are better general data structures for most situations that you come across. I mean, there are cases where those other things are good, so you don't want to completely ignore them, but generally, yeah, you're trying to be lower bound in this chart.
AUDIENCE: What I mean is that if I don't have any --bound like I'm using them all except for selection sort--
JASON KU: So there's too many things on here for us to test all of them on an exam. So don't be afraid if not everything is covered. Worry when things are covered 18 times on the exam, that's not good. OK. So for radix sort, there are situations where you get linear time. It's when you're polynomial-bounded. Are there times when I want to use radix sort when I'm not polynomial-bounded in my integers?
AUDIENCE: Well, if you're not polynomial-bounded, then that could take a really long time.
JASON KU: Sure, yeah. But like where's --the this will be worse than n log n when? It's definitely better than n log n when u is polynomially bounded. Because it's linear. Yeah?
AUDIENCE: [INAUDIBLE] n to the n.
JASON KU: n to the n. OK. So if I put n to the n in here, I get an n factor that comes out here. That gives me a quadratic running time, which is not great. But when will this be better then n log n? Yeah?
AUDIENCE: n to the c, because it's less n to the c.
JASON KU: n to the c-- so that'll definitely give us linear time. That's what this is saying. But we can actually do better than n log n if this u is n to the c times log n for some c. If --it's like if it's n to the c log log n, that's smaller than log n. So that's a better algorithm. That's a faster algorithm.
Do you guys see why we wrote these things this way? It's so that we give you a more --precise this is more important for you understand what u means here than that this thing sometimes runs in linear time. We want to know when it runs in linear time. Does that makes sense? Or when it runs faster than merge sort. Does that makes sense?
OK. So that's sorting. We have sequence-type data structures. We have linked lists, we have dynamic arrays, we have sequence AVLs. Sequence AVLs are great. I don't know why no one teaches it. They're great. They don't teach it probably because they're actually not that useful. You don't actually use the insert in the middle a lot in coding.
So it's-- you can usually get around with like shifting something to the end and doing dynamic operations there, and that's a lot of the games that you try to play so you don't have to make your own data structures, and you can just use the Python list that's in your-- like native to your thing. But it has a theoretical interest because it gets these kind of balanced bounds if you need to insert in the middle of this sequence.
Now some of you look at me and had a question before that was like, but Jason, how does linked list operations on the end of a linked list, why does that take linear time? It's because in lecture, we presented you with what? A singly-linked list. It just had pointers to the next thing.
And if I only have pointers to the next thing and a pointer to the head, for me to find the end, I need to walk all the way down the list. Now let's say I keep my pointer to the tail, then finding that and n1 is fine, but removing it still takes linear time, because I don't know what came before me. And so that's why in p-set whatever, 1, 2, I don't remember, you stored a pointer back to your previous one that gave you doubly-linked lists. Yay, right?
So actually, expanding this table out, you can reference a doubly-linked list here and get this one as constant time. Does that make sense? And this one's still linear time. OK. But this one's still linear time. We actually also showed you how to do this in constant amortized. Do you guys remember that? That was in problem session 2 or 1? I don't remember. It was whatever we talking about amortized stuff.
We --got we got actually both of these to one amortized using the concepts of the dynamic array. And then we actually did it one more time where we got a good doubly-ended thing. What was that? Does anyone remember? We went to problem session 3. Yeah?
AUDIENCE: Oh. I'm probably wrong, this is--
JASON KU: OK.
AUDIENCE: --problem session 3. But like q dq type stuff?
JASON KU: So q dq, those are talking about doubly-ended things. Those are implemented in a certain way, which is actually one of these things. It's actually, I think, this one in Python. But there's --a we used a different data structure to get something that had really good at. It had really good pop and append and first and last dynamic operations. Do you --guys anyone remember from session 3?
AUDIENCE: Isn't it dynamic array [INAUDIBLE]?
JASON KU: That's what this was. OK. We got one that had expected bounds. Does that help? What? I heard someone say it. Hash table. Yeah. So basically what you did, instead --of this is a sequence thing. These things don't have keys. But I could identify with each item as I stick it in, a key representing its index. And I could use a hash table that way.
Now there was some difficulty if I removed the first thing. Because actually, how --those are all of my indices have now changed. But if I just store what the smallest index in my thing is, then I'm golden, because I can compute what that index should be I change things at the front.
OK. So there's actually three different ways we showed you of getting constant time at the front and the back of this thing. So actually, you can think of that as standard material that you can reduce to. That's the one exception of the things that I'm not showing on this chart, but is a bonus if you're watching this problem session. Yeah?
AUDIENCE: So isn't a hash table a set data structure, though?
JASON KU: It is. But we used it to implement a sequenced data structure. So I will refer you to that problem session if you want to learn more about it. So, any questions on sequenced data structures? Yeah?
JASON KU: Yeah, uh huh.
AUDIENCE: Should we be-- or-- yeah. Should we be able to look prove the tables that we're given?
JASON KU: So I would hope-- hope-- that if I gave you a blank table, that you would be able to fill it out. That's how well I want you to know how these things are implemented. We're not going to have that on the exam. That's a boring kind of question.
AUDIENCE: But it is important to know?
JASON KU: Yeah. I would-- it's good for you to think, like, oh, if I'm going to use an AVL tree, operations are generally going to be log n. That's a really useful thing-- or, if I'm going to use a hash table, dictionary-type operations, finding, inserting, and deleting, those are fast. Doing order-type operations on hash tables is bad, that's hard to do because I have to basically look through all my things. Knowing that dynamic operations on a sorted array is bad.
Or knowing that-- you have to think about here what we mean when we say linked list and dynamic array in this table, because--
JASON KU: Yeah, exactly. We are implying singly-linked here because that's what we presented to you in lecture. And so those are the standard things that we want you to reduce to. Yeah?
AUDIENCE: --three modified standard things are the doubly-linked lists, the doubly-ended--
JASON KU: Yeah. Basically you can assume that you have-- it's this one that you probably want to use, because you get constant indexing. And then pretty good on both ends. But if you need a double-ended Q, you can also reduce it to having two dynamic arrays back-to-back when going this way, when going this way.
And so there's lots of-- we're not going to give that to you as a standard method because there's literally like four methods we showed you how to do. So you choose one. OK. Yeah?
AUDIENCE: In the interest of saving time on the exam, if we want to say, like, we do this thing with the sequence AVL and it takes log n time. Do we have to say the sentence like, because this is sequence AVL, it takes a longer time because this? Or can we just like say like, for the table that we know?
JASON KU: Yeah, no. If you told me that you're storing things in a sequence AVL and you just say-- you basically say what you do to it, and you say, which takes blah, blah, blah time, you don't need to say because it's a sequence AVL, because you already told us it's a sequence AVL. I believe that you probably wrote the chart on your cheat sheet and you looked it up. All right. Any other questions on this? No. Yeah?
AUDIENCE: Just make sure, the doubly-linked lists will make the insert on [INAUDIBLE]. Is that the only change?
JASON KU: Yes. That doubly-linked list gives this guy constant time. And actually, there are two operations here, insert, delete, and-- I guess there's a find here as well. If I just store the tail pointer, that gets defined to constant time. but it doesn't get the dynamic ones to constant time. I need to store the previous pointers as well at each node. Does that make sense? OK.
Lastly-- or I guess second to lastly, set data structures. There are a little bit more of these. Not so many more. But yeah. We had a sorted array, gets good find, but is not dynamic. We set AVL tree which does pretty good find and is dynamic. Again, you get this n log n overhead to build, because essentially what you're doing with both of those data structures is to sort.
But if I'm looking on a theory question, like I'm not asking you something specifically about a sorted array, if you have a choice between this data structure and this data structure, which one do you choose? Well, I don't quite know, because it's almost like this one's better at everything except for this one. But can anyone tell me how to make this one constant time as well?
Augmentation. I could just store in my subtrees the max or min. And I can make this one strictly better than this one. So in theory problem, you probably just want to use this one. Here, hash tables, direct access arrays, even better for these operations. That's great. But they suck at these. So if you need these, don't use these.
And in actual coding, especially if you're coding in a language that's not Python, something that doesn't automatically give you a hash table, if you're in C in a microcontroller lab at MIT here, you're taking 6115 or whatever, and you're doing assembly, usually what you're doing is a direct access stuff. Because that's giving you the jumps that you need in machine language to actually go and access this in constant time. That's generally if you have control over the keys that you're putting-- that you are putting in your data structure, you don't want this overhead of running your keys through a hash function to look these things up. You just store the things in an array.
You use the hash table when you don't have control over the keys or your keys are like strings or something. So that's when you use the hash table. Now for our purposes, usually a hash table is just as good unless we're asking you for worst case bounds. And we're going to do this when we talk about the data structures problem. If we give you a situation where we don't care if you achieve worst case are expected or amortized or any of these things, we'll just say, make sure you state which one you achieve. And as long as you analyzed it correctly with respect to your data structures, then you're fine.
But if we say you better do worst case here, I'm going to slap you around. Then please get those bounds. Don't use the hash table in that case. That makes sense? OK. Lastly, we've got priority queues which we talked about. I'm not going to go through this one very much. It's basically just adding this to the thing. But in actuality, you can get all of these bounds with a set AVL-- I mean a sequence AVL tree with max or min augmentation, which isn't on this list because we didn't really talk about it, but hopefully you can-- if you need these bounds without the amortization, then you could achieve them.
All right. So that's basically everything that we've talked about in the class. We're going to spend the rest of the time working a couple of data structures problem. I'm not-- there's a number of different types of questions you'll actually see on the quiz-- the practice quiz that we gave you from last term. There are some what I call mechanical-type questions up at the front. Then usually some reduction-type problems where you're reducing to using some sorting algorithms or some data structures.
And then usually the latter ones are ones where you have to do some kind of something additional, like some augmentation or some divide and conquer or something like that. OK. So we're going to go ahead and spend the rest of the time working a couple of these problems. These were from spring of 2019 on the exam. And actually, one of the TAs who's TAing for us now was also TAing for us in spring 2019. Was grading problem number 2 here, the rainy research problem, and just hated me, because no one did it right. All right. So, let's try to solve these problems.
So problem 1 is about restaurant-- restaurant, yes, OK. All right. So basically, what's happening? A popular restaurant, Criminal Seafood. What's the reference? Legal Seafood. Yes, opposite. Does not take reservations, but maintains a wait list where customers who have been on the wait list longer are seated earlier. Sometimes customers decide to eat somewhere else, so restaurant must remove them from the wait list. OK.
Assume that a customer has a different name. No two customers are added to the wait list at the exact same time. So there's a kind of an ordering at which people are being added to this wait list. Does that makes sense? Design a database to help Criminals Seafood maintain its wait lists supporting the following operations, each in constant time.
OK, so here, we've refactored the running time up to the top. And it's a-- oh, sorry. I added the build here. I guess that's still constant time. That's fine, OK.
AUDIENCE: I have time d--
JASON KU: Yeah. State whether each running time operation is worst case, amortized, expected. So, when you see that statement, you're saying, OK, I'm allowed to use a hash table if I want to. I just have to make sure if I use one, I label my operations expected and amortized when they occur.
Which operations are expected? Basically all of them. Which operations are amortized? The ones that changed what's in the data structure, insert, delete. OK. So we've got some operations, building an empty thing. Adding a name-- so this name is x to the back of the wait list. What do I know about x? What do I know about the names per our assumptions in this class?
AUDIENCE: They're unique, so they can be a key.
JASON KU: They're unique, so they can be a key. And they fit in a constant number of words by our assumption. So I can compare two of them in-- or I can hash one in constant time. That's kind of the assumption we make on these inputs, they're strings. And I didn't assign you a bound on their length, so it's probably not something you need to worry about.
OK remove a name. So already I'm feeling like, I need to be able to find things by their name. OK. And then seat the next person in line. Does that makes sense? So what kinds of things do I need to maintain here? I have people. They have names and they have kind of places, the time that they came in. But am I given the times? No. I'm not given times anywhere. It's not on the inputs to my operations. So it's not like I'm going to be able to key on time. That makes sense?
What's the important part about the times? The order.
AUDIENCE: And you're given that.
JASON KU: Yeah. Basically whenever-- I'm basically trying to maintain a sequence on these guys. There's a front one and a back one and people in the middle. And I want to make sure that order stays the same, or else people are going to getting angry at me, because ah, they came here after me and I got a-- yeah. You've been in that situation. OK.
So, we're trying to maintain some kind of sequence, an extrinsic order on these things. But we're also needing to be able to look up people by their name because I want to be able to change this thing. Does this maybe sound familiar to some other problem we had on this term's problem sets? Yeah, I think there was a problem where--
AUDIENCE: The chat.
JASON KU: The chat. You had to store a sequence. But you also had this dictionary that you had to look things up. Now the nice thing about that situation is that the things that you needed to look up on was static. And so what could I use for my dictionary, my sets data structure for that? Anyone remember? You could just use a sorted array. Because it's static, these things aren't updating all the time, and so it was fine for me to just use a static array.
Here-- and we gave you lookup times that were worst case logarithmic. Here, I'm asking for constant time. Sorted array is not going to cut it, set AVL's not going to cut it, so what am I going to use?
JASON KU: A dynamic array or a hash table. Now dynamic array might not be great because I don't actually have a numeric bound on how big these keys are, I just know that they fit in words. So I can't actually make a direct access array because those words-- while they fit in a constant number of words, I don't know if the integer representation of those are polynomial-bounded. Does that make sense? So I do want to use hashing in this case.
And so what do I want to maintain? I want to maintain a sequence data structure on customers. Customers. Is there a U in there somewhere? No. This is right, right? OK. And a set mapping. So this is usually how I do it. Like I want to say a set keyed on something. If it's just a set, then I just have it key. I can look up whether that thing is there or not.
But when I'm actually having it mapped to something else, I'll say, mapping a key, space, to something else, usually the item that I'm storing or maybe some property of the items that I'm storing. Does that make sense? OK. So mapping, what do I map here? Names. To-- oh. To what? Do I want to map?
AUDIENCE: To time--
JASON KU: The time. The time that they entered. Do I want to do that?
AUDIENCE: Well, you can't exactly do it with the time that they entered, but the sequence on the customers will show what's next in line.
JASON KU: Yeah. I want to store it to where it is in this sequence. So I'll just store the index where it is in the sequence and I can just look it up. That's sounds-- yeah? Is that what--
AUDIENCE: I thought the indices change.
JASON KU: Yeah. Indices change every time I add or remove things.
JASON KU: Ah, yeah. Store a pointer. Pointer to place in sequence. Now, that's a little weird to say, because I haven't told you how I represented this sequence. But conceptually I can say I'm drawing a pointer to someplace in this that represents where it is, I'll deal with the details of that later.
But generally I call this a linked data structure, because I'm linking between two data structures so that I can do a query in one, say, and find out where it is and the other. Or do a query in the other, that kind of thing. So, everyone kind of understand why I chose these things? How I approached this problem? Yeah?
AUDIENCE: I'm not sure how a pointer fixes the problem of just storing an index. The place that a person is at in the sequence is going to change over time.
JASON KU: Mm-hmm.
AUDIENCE: --assuming that a pointer updates, would that change?
JASON KU: So-- OK. So let's say if every time-- so let's say I'm storing-- let's say I'm storing this sequence in a linked list, let's just say, because I know that I'm going to have to be inserting and deleting things from the middle of this thing. Does that make sense? All right. So I'm going to go ahead and say that this is going to be a linked list. Now I have this linked list.
So if I said, all right, the set data structure in my set stores-- or stores where it is in the linked list, let's say it's stored at k, all right? Cool. Now if I seat this guy, I stick him at his table, every index has changed. So for me to update the indices stored in this set data structure, I have to change every single one of them.
JASON KU: Does that make sense?
AUDIENCE: Yeah, no, this makes way more sense.
JASON KU: Yeah. So really what I want, not to store a number here, but an actual pointer to the node that's containing this thing, because the node isn't changing unless that thing left my data structure. The node, the address of this thing in memory, the node is just a little container that contains an item and the next pointer, and actually we're going to need a previous pointer here. We're going to need a doubly-linked list.
Because what we're going to do when we remove something, we need to stitch it back together in constant time, which means we need to know what's the one in front of us and the one behind us. Yeah? No? All right. So linked list here. We already said maybe using a hash table here.
And so, OK, great. This is basically enough for me to say, this is my data structure, these are the invariants. What are the invariants? It's storing all the customers. That's an invariant. I mean, it's not a very strong invariant, but it's-- yeah, I mean, I should say that. I'm storing all the customers because I'm going to have to make sure that I'm maintaining that when I do a query.
And then here, set mapping names to pointers. The place in this thing. And as I remove things, I need to make sure that invariant stays the same. It's still mapping all the names of customers that I have in my thing to their nodes. So when something leaves, I better need to make sure that both of these things are updated. Does that make sense?
OK. So how do I maintain these operations? We've got a build-- build just sets up empty of these things. It's easier to say that. I build an empty linked list, doubly-linked lists on the customers, and I build a set-- a hash table mapping nothing to nothing right now. OK. So that's build.
I'm going to be precise here and actually write down the thing because I told you not to ignore a operation, or else we can't give you points for it. The next one-- yeah?
AUDIENCE: By emptying-- by building an empty linked list--
JASON KU: Mm-hmm.
AUDIENCE: --it just has like the head and the tail, it's actually none, but, like, it exists?
JASON KU: Yeah, right. It's a thing in memory that we store. It has a pointer to a head and a tail. Those are none right now. But we will add things to it. OK. So the second one, add name. What do we have to say? We have to update this data structure so it maintains it's a variance. I have to-- usually I start with one of them. I get to a point where, oh, I really should have updated the other one first. So sometimes it's hit or miss.
What do I want to do here? Well, I have no idea where in this sequence this is. So I kind of have to go here first. Oh, sorry. Adding a guy, I could do either way. I know-- where is he going to go? End of the list. So I just stick it there. So add x to end of sequence.
So I come to a trick that I like. Sequence, set. OK. In this instance I only have one sequence and one set. And so calling them sequence and set, probably fine. I'm not going to confuse the greater, I'm not going to confuse myself. But when I have more than one of these things, or even for brevity on an exam where I'm time-constrained, give these things a name. Say this is a sequence C and M. I don't know. I see customer here, I see a map here, so maybe. I don't know. But just give them a letter, then we can follow you much more clearly and you can refer to these things more precisely. So end of C.
Then what do we have to do? I fix this guy, this guy's good. Now I have to fix this guy. So I add x to the set and I map it back to that node that I just came from. I could store it in a temporary variable. Add x to M pointing to n node v to v. OK. Cool?
So, I added my x to the end of C into a node v. I'm kind of-- I labeled it so I can reference it later. And in code, I would probably remember what that node was. Add x to M pointing to that node, and now I've maintained my invariant. Great.
AUDIENCE: Maybe [INAUDIBLE].
JASON KU: Add key x. Or add x to M having the key of x point to that node. Does that make sense? There's a subtlety there. 3 3, we have remove-- I don't remember the name, whatever. OK. So here, we have to do-- order matters. I don't know where it is in the sequence, but I'm going to look it up in the set, and I'm going to remove it from the set, and I'm going to look at whatever it points to and remove it from the sequence. I'm running out of time a little bit here, so I'm not going to write all that down. You guys understand what that means.
Here, we're using the fact that it's a doubly-linked list so that we can relink things together. You don't have to tell me the three points-- the previous pointer of my next thing to the next-- the-- right? You don't have to tell me how to relink those pointers because we did that already in problem set 1 or whatever.
AUDIENCE: Just in general, as you're in a situation for where you're using not a doubly-linked list--
JASON KU: Mm-hmm.
AUDIENCE: where just a linked list would be better than a doubly linked list because it seems like it always solves problems.
JASON KU: Yeah. Doubly-linked lists is almost always strictly better than a singly-linked in theory problems. So yeah, use it. OK. And then seat, last one. That's just take the front of the sequence, remove it, change the head around, but I get that you can delete first. You're reducing to the interface that we had. You're deleting first on the sequence.
But now, we have a situation. I deleted the first guy. How do I know who the first guy is? Well, I'm storing its name in there. I'm storing the names of these customers. So I know who is at the front, I look in this set data structure, and I remove that entry. Does that makes sense? Because now, I no longer need to maintain where he goes.
Now in actuality, I could just not update this set data structure. But if I do that, then, well, my running times are still linear time. I'm not giving you a bound on space. I'm still constant time, sorry. So you don't actually have to do that removal, but if that customer comes back and wants to get on the wait list again, there are things to consider.
OK. So that's that question. Next question, last 10 minutes or so. Raniy Research. This is a problem that people had nightmares about. OK, so basically we've got a Meather Wan. He's a weatherman. A scientist who studies global rainfall. And he's got a bunch of sensors everywhere. And each one can post to the cloud or something a measurement that's of the form a triple of integers r, l, and t where r is a positive amount of rainfall, an integer; a latitude, an integer again; and at a time. We got three things to deal with here. Yuck.
But they're all integers. And don't be like, oh, well Jason, latitudes are pretty small. So I can assume these integers are small and these things take constant time. I'm not specifying to you a resolution at which I'm measuring these integers. And I haven't given you a bound between what that resolution is compared to the number of measurements that I have, so I don't play those games. OK.
The peak rainfall at a given latitude since a particular time is the maximum rainfall at any measurement at that latitude measured at a time greater than or equal to that time. Does that make sense? Or 0 if there aren't measurements at that latitude. OK. To score after the time-- or before the time or whatever.
Describe a database that we can build it in constant time. Is an empty one-- I added this one because we weren't good about that last spring. Record data. We give you a triplet. And then-- so record data, for it to be correct, I just have to maintain that information. For these kind of updates, I don't-- it's really hard for me to argue that this thing is correct. Because I just-- I throw it at the database, the database doesn't have to give anything back to me.
So the important thing here about correctness is that peak rainfall gives it to me and it gives it to me in the time bound that I'm looking for. And peak rainfall is returning the peak rainfall at a particular latitude since t. So we have three things. Yep?
AUDIENCE: Given that you never have to return a single measurement, is that [INAUDIBLE] have a record of it?
JASON KU: There's the potential that you don't need to store all of the information because all we're doing is giving you back R's, essentially. It's possible that you don't need to store the latitudes or times at all. You don't even need to store the triplets. Now in reality, I'm querying on the latitudes and the times. So I should store them somewhere, but I might be able to compress them.
In particular, many things could be stored at the same latitude. That's kind of the whole point of the query. And so we want-- I mean, we may only need to store that latitude once. Does that makes sense? OK. I'm going to wait for questions until after because I want to get to a solution to this problem.
All right. So what do we need to do? We need to be able to add things. And I want to return. So return, I'm going to have to query something, and then I'm going to return something. So return peak rainfall at latitude l since time t. What do I care about in a particular query? I only care about all the things at l, at a latitude l.
So really, this isn't such an interesting thing, but I want to be able to have maybe many data structures, one associated with each L. Does that make sense? And how can I find one in each L quickly? Put it in a dictionary. What's my time bound? Worst case log n. So what data structure do I use for that set data structure? A set AVL.
So you're going to first have a set AVL-- say, L-- mapping latitudes to-- well, now we have more data structures. I want to store a lot of the things that have the same latitudes in another data structure. The ones probably storing the times in the rainfalls of all those measurements. Yeah? Yeah.
AUDIENCE: A hash table.
JASON KU: A hash table. OK, so what kinds of query am I going to want to do on the things at the same latitude?
AUDIENCE: You're going to want to get the times.
JASON KU: I'm going to want to get the times, but more than that, I'm doing kind of an ordered query. I need the things less than a certain time. Greater, sorry.
AUDIENCE: So it's really--
JASON KU: Just a second.
AUDIENCE: Could you do like an AVL for the time and an AVL for the rain?
JASON KU: OK. Do I care about an AVL for the rain? I mean, looking up on the rainfall.
JASON KU: No. So I'm going to go ahead and store these things in a time-sorted AVL to mapping latitude l to-- I'm going to call this data structure a time data structure. I'm going to say it's t of l. That kind of looks like a recurrence, so it's a little irking me right now, but I don't have anything better.
All right. So now each one of these time data structures is a set AVL mapping time to the rainfall measurement. All right. So that's going to-- if my query was, return the peak rainfall-- sorry. Return the rainfall of the thing with latitude l and time t, we'd be done, kind of. You'd know how to support that query. To insert things, I insert things into both data structures, and I just look it up.
The one complication here is that I'm not asking for what is the rainfall at a particular time. I want to know what the maximum rainfall is up to this time. OK. So max heap's good if I want to know the global max. But here, I want to know the max bounded by a certain range. So we're going to-- you can ask me questions after this, we're running out of time a little bit. So anyone have an idea for how-- yeah?
AUDIENCE: Could you just augment the AVL with the max?
JASON KU: OK.
AUDIENCE: And you can just look at the right child and then just look at the max at that point?
JASON KU: Ah, OK. So what your colleague is saying, if we augment by the max r in my subtree, maybe we can use that to figure out this query. Because we're ordered on t, right we have this nice monotonic property that everything that's going to be in my query-- everything to the right of a certain time-- if my time is above t at a particular node, in everything in the right subtree is also above that t because of the order of my set's data structure, because I'm ordered on times.
So there's maybe the possibility that if I look to my right subtree, I can not do work all over here by just looking at the max in that subtree. so That's an idea of, let's say, augmented by subtree max r. You probably want to give this a name as well. So like v max where v is a node in my thing.
And I want to show how to support this, how I can compute that from its children. So how do I actually support this query, I can think of it recursively. I have a couple of cases. If I'm at a v here, I want to define a recursive function that's called peak rainfall of a given node lower bounded by a t.
So if I'm here, there's two cases. Either my t, my time is bigger or smaller-- is in my range or out of my range. If it's out of my range, what do I do? It's lower than my time bound. I can just recursively call this function on this node. Because I know that anything is going to be down here that I care about.
And that's just one recursive call down the tree. And so if I only limit myself to one recursive call down the tree, I'm always going down each time. This is going to take logarithmic time. So that's the first case, that's the easy case. This thing is not in range, I return recursively the thing to my right.
What's the other case? I'm in my range. Well now I could return, recursively call both sides. Because that's what this peak is talking about. What's my peak rainfall? But if I do that, if I call it here recursively and I call it here recursively, that could take linear time. I might touch every node in my tree.
So why did I do this augmentation? So I don't have to do work on this node. I just return the maximum rainfall in this subtree, and then recurse on this side. So I did constant work on this side, I did one recursive call down here which you could go to the bottom, but that's OK, I can afford to go to the bottom of the tree. Does that makes sense?
And if I don't have either subtree, then I'm done. If at any time. I don't have the node that I'm supposed to recurse on, I take the max of this subtree, myself, and whatever the recursive return value is here. And comparing three values, returning their max. Does that make sense? OK. So that's what we call a one-sided range query. So in the-- I think problem session 4 that I didn't get to, it shows you a way to do this for a two-sided range query where I need to know the maximum of all things between two things.
But it's really no more difficult than this to finding a recursive function that uses an augmentation so that you don't have to do recursive call on both sides. Does that makes sense? OK. That's going to be it for now, and I can take questions after. Thanks for coming.