# Lecture 21: Cryptography: Hash Functions

Flash and JavaScript are required for this feature.

Description: In this lecture, Professor Devadas covers the basics of cryptography, including desirable properties of cryptographic functions, and their applications to security.

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation, or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

SRINIVAS DEVADAS: All right, let's get started. Good morning everyone. I see a lot of tired faces. I'm not tired. Why are you tired?

[LAUGHTER]

I only lecture half the time. You guys take the class all the time. So today's lecture is about hash functions. And you may think that you know a lot about hash functions, and you probably do.

But what we're going to do today is talk about really a completely different application of hash functions, and a new set of properties that we're going to require of hash functions that I'll elaborate on. And we're going to see a bunch of different applications to things like password protection, checking the integrity of files, auctions, and so on and so forth. So a little bit of a different lecture.

Both today and on Thursday I'm going to be going to be doing cryptography and applications, not too much of algorithms. But we will do a little bit of analysis with respect to whether properties are satisfied, in this case by hash functions or not. So let's just dive right in.

You all know what hash functions are. There's no real change in the definition. But the kinds of hash functions that we're going to be looking at today are quite different from the simple hash functions, like taking a mod with a prime number that we've looked at in the past. And the notion of collisions is going to come up again, except that again we're going to raise the stakes a little bit.

So a hash function maps arbitrary strings-- let me do this right. So you're not making a statement about the length of the string. You will break it up, even if you had a string of length 512, or maybe it was 27, you do want to get a number out of it. In a specific range there's going to be a number of bits associated with our hash functions. And previously we had a number of slots associated with the output of the hash function. But the input could be arbitrary.

And these arbitrary strings of data are going to get mapped, as I just said, to a fixed length output. And we're going to think about this fixed length as being a number of bits today, as opposed to slots in the hash table. Because we really aren't going to be storing a dictionary or a hash table in the applications we're going to look at today. It's simply a question of computing a hash.

And because the fixed length output is going to be something on the order of 160-bits, or 256-bits, there's no way that you could store two arrays to 160 elements in a hash table, or even two arrays to 64 really. And so we're going to just assume that we're computing these hashes and using them for certain applications.

I just wrote output twice I guess. So map it to a fixed length output. We want to do this in a deterministic fashion.

So once we've computed the hash of a particular arbitrary string that is given to us, we want to be able to repeat that process to get the same hash every time. We want to do this in a public fashion. So everything is public. There's no secrecy.

There's keyed hash functions that we won't actually look at today, but maybe in passing I'll mention it next time. We're not looking at keyed hash functions here. There's no secrets in any of the descriptions of algorithms or techniques I'm going to be describing today.

And we want this to be random. We want it to look random. True randomness is going to be impossible to achieve, given our other constraints. But we're going to try and approximate it with pseudo-randomness. But we'd want it to look random, because we are interested-- as we were in the case of dictionaries and the regular application of hash functions-- we are interested in minimizing collisions.

And in fact we're going to raise the stakes really high with respect to collisions. We want it to be impossible for you, or anyone else, to discover collisions. And that's going to be an important property of collision resistance that obviously is going to require randomness. And those are the three things we want, deterministic, public, and random.

And so just from a function description standpoint you have 0, 1 star here, which implies that it's an arbitrary length strength. And we want to go to 0, 1 d. And this is a string of length d. So that means that you're getting d-bits out from your hash function. And here the length is greater than or equal to 0.

So that's it. Not a lot that's new here. But a few things that are going to be a little bit different. And there's some subtleties here that we'll get to.

I want to emphasize two things, one of which I just said. There's no secrecy, no secret keys here in the hash functions that we are describing. All operations are public. So just like you had your hash function, which was k mod p, and p was a prime and p was public and known to everyone who used the dictionary, everything here we are going to be talking about is public. So anyone can compute h.

And we're going to assume that this is poly-time computation-- not too surprising-- but I'm being quite flexible here. When you look at dictionaries, and you think about using dictionaries, and using it to implement efficient algorithms, what is the assumption we kind of implicitly made-- are perhaps explicitly in some cases-- with respect to computing the hash? Anybody? Yeah?

AUDIENCE: Constant time?

SRINIVAS DEVADAS: Constant time. We assumed-- so this is not necessarily order 1, right? So that's important. So we're going to-- I want to make sure you're watching. So you're going to raise the stakes even with respect to the complexity of the hash. And as you'll see, because of the desirable properties, we're going to have to do that. We're going to ask for really a lot with respect to these hash functions.

Nobody can find a collision, right? And if you have something as simple as k mod p, it's going to be trivial to find a collision. And so these order 1 hash functions that you're familiar with aren't going to make the grade with respect to any of the properties that we'll discuss in a few minutes. All right, so remember this is poly-time computation.

And there's lots of examples of these hash functions. And for those of you who are kind of into computer security and cryptography already, you might have heard of examples like MD4 and MD5. These are versions. MD stands for message digest. These were functions that were invented by Professor Rivest.

And they had d equals 128 way back when-- 1992, if I recall-- when they were proposed. And these algorithms have since been broken in the sense that it was conjectured that they had particular properties of collision resistance that it would take exponential time for anybody to find collisions. And it still kind of takes exponential time, but 2 raised to 37 is exponential at one level, but constant in another level. So you can kind of do it in a few seconds now.

So a little bit of history. I'm not going to spend a lot of time on this. MD5 was used to create what was called a secure hash algorithm. This is 160-bits. And this is not quite broken at this point. But that people consider it broken, or soon to be broken.

Right now the recommended algorithm is called SHA-3, secure hash algorithm version three. And there was a contest that ran for like 18 months, or maybe even longer, that eventually was won by what turned into the SHA-3. And they had a different name for it that I can't recall. But it turned into SHA-3.

And what happened along the way, as we went from MD4, MD5, SHA-1 to SHA-3, is that this amount of computation that you had to do increased. And the complexity of operations that you had to do in order to compute the hash of an arbitrary string increased to the point where-- you want to think about this as 100 rounds of computation. And certainly order d computation, where d is the number of bits. And perhaps even more. So it's definitely not order 1.

So as I said a little bit of context with respect to the things that are out there. At the end of the lecture I'll give you a sense for how these hash functions are built. We're not going to spend a lot of time on creating these hash functions. It's really a research topic onto itself and not really in the slope of 6.046. What is in the scope of 6.046, and what I think is more interesting, which is what we'll focus our energy and time on, is the properties of these hash functions. And why these properties are useful in a bunch of different apps.

And so what is it that we want? We want a random oracle. We want to essentially build something that looks like that, deterministic, public, random. And we're going to claim that what we want is this random oracle which has all of these wonderful properties that I'm going to describe. I'm going to describe the random oracle to you, and then I'm going to tell you about what the properties are.

And then unfortunately this is an ideal world and we can't build this in the real world. And so we're going to have to approximate it. And that's where the MD4's and the MD5's and the SHA-1's came in, OK? So this is not achievable in practice.

So what is this oracle? This oracle is on input x, belonging to 0,1 star. So that could be an arbitrary string. If x not in the book-- so there's this the book, all right? And there's this infinite capacity book that has all of the computations that were ever done prior. And they're always stored in the book.

And that's how we're going to get determinism. Because this book initially gets filled in. All of the entries in the book are filled in using pure randomness. So you flip a coin d times to determine h of x.

So that's basically it. And you just keep flipping. You have to flip d times. And so if x was 0, you flip d times, d was 160. You flipped a coin 160 times and got a string. If x were 1, flip 160 times, you get a different string with very high probability, obviously. And so on and so forth.

But what you do is you have this book. So you're going to record x h of x in the book, OK? So at some level your hash function is this giant look-up table in the sky, right? Actually not giant, infinite capacity look-up table in the sky. Because you can put arbitrary strings into this.

And if it's in the book-- this is obviously the important part that gives you determinism-- then you return y, where x and y are in the book, OK? So you get a random answer every time, except as required for consistency with previous answers. So the very first time you see a string, or-- and the whole world can create this book. It's public.

So if I created the book at first with a particular string, let's say Eric. I was the string. And I'm the one who put the entry-- x equals Eric, and h of x, h of Eric equals some random 160-bit string-- into the book, I get credit for that, right? But if you come a nanosecond later and ask for h of Eric, you should get exactly what got put into the book when I asked for h of Eric. And so on and so forth.

So this is true for everybody. So this is like-- I mean basically impossible to get. Because not only can anybody and everybody query, you have to have this ordering associated with people querying the book. And you have to have consistency. All right. So everyone convinced that we can't build this? All right.

If you took anything out of this lecture, that's what you should take. No, no. There's a lot more.

So we want to approximate the random oracle. And we're going to get to that. Obviously we're going to have to do this in poly-space as well.

So what's wrong with this? Of course this picture is I didn't actually say this, but you'd like things to be poly-time in terms of space. You don't want to store an infinite number-- this is worse than poly-time, worse than exponential time, because it's arbitrary strings that we're talking about here, right? So you can't possibly do that. So we have to do something better.

But before I get into how we'd actually build this, and give you a sense of how SHA-1 and MD5 were built-- and that's going to come a little bit later-- I want to spend a lot of time on the what is interesting, which are the desirable properties. Which you can kind of see using the random oracle. So what is cool about the random oracle is that it's a simple algorithm. You can understand it. You can't implement it. But now you can see what wonderful properties it gives you. And these properties are going to be important for our applications, OK?

And so let's get started with a bunch of different properties. And these are all properties that are going to be useful for verification or computer security applications. The first one, it's not ow, it's O, W. It's one-wayness, all right? So one-way, or one-wayness. And it's also called-- you're not going to call it this-- but perhaps this is a more technical term, a more precise term, pre-image resistance.

And so what does this mean? Well this is a very strong requirement. I mean a couple of other ones are also going to be perhaps stronger. But this is a pretty strong requirement which says it's infeasible, given y, which is in the-- it's basically a d-bit vector, to find any x such that h of x equals y. And so this is x is the pre-image of y.

So what does this say? It says that I want to create a hash function such that if I give you a specific-- we call it a 160-bit string, because we're talking SHA-1 here, and that's the hash-- I'm going to have, it's going to have to be impossible for me to discover an x that produced that 160-bit string, OK? Now if you go look at our random oracle, you realize that if you had a 160-bit string, and perhaps you have the entire book and you can read the entire book.

It's an infinite capacity book. It's got a bunch of stuff in it. And know that any time anyone queried the book the first time for a given x, that there was this random 160-bit number that was generated and put into the book. And there's a whole lot of these numbers, right?

So what's going to happen is, you're going to have to look through the entire book, this entire potentially infinite capacity book, in order to figure out if this particular y is in the book or not. And that's going to take a long time to do, potentially, OK? So in the case where you have a random oracle you'd have to go through and find-- looking at the output hash corresponding to each of the entries in the random oracle, you're going to start matching, match, match, match, match, it's going to take you exponential time. Well actually worse than that, given the infinite capacity of the book.

So this clearly gives you that. Now you may not be a completely satisfied with that answer because you say well, you can't implement that. But we'll talk a little bit, as I said, about how you could actually get this. But what's-- I should be clear-- is that the simple hash functions that we've looked at in the past just to build dictionaries do not satisfy this, right?

So suppose I had h of x equals x square mod p. Is this one-way, given a public p? No of course not, right? Because I'm going to be-- it's going to be easy for me to do something. Even though this is discrete arithmetic I could do something like, well, I know that what I have here-- actually let's do it with something that's simpler, and then I'll talk about the x squared.

If I had something as simple as x mod p, I mean that's trivially broken in terms of one-wayness. Because I know that h of x could be viewed as the remainder. So anything-- if this is h of x, and let's just call that y for a second, because that's what we had it out there. Something that's a multiple of y plus the remainder-- so I could have a-- is that right? Is that what I want? Yeah.

No, plus y. So I want a of-- well since I can't figure it out, why can't you? What do I need to put in there in order to discover an x that would produce a y? Can I write an equation? Yeah?

AUDIENCE: Could you just write y itself?

SRINIVAS DEVADAS: Just y itself. That's right. Good point. Just y itself in this case. Good. I knew you guys were smarter than me. This proves it.

So if you just take y-- and y remember is going to be something that's 0 to p minus 1, right? And that's it. It just goes through, right? So that's a trivial example, right?

Now if I put x squared in here, obviously it's not y, but I could start looking at-- what I have here is I'm going to get y that looks like x squared. But I could take the y that I have, take the square root of that, and then start looking for x's that give me the y that I have. Actually it's not a complicated process to try and figure out, through trial and error potentially, what an x is that produces a particular y for the kinds of hash functions that we've looked at, all right?

Now as you complicate this equation it gets harder. Because you have to invert this set of equations. And that's what the game is going to be when you go create one-way hash functions. The amount of computation that you do in order to compute the y is going to increase to the point where, as I mentioned, you have 80, 100 rounds of computation, things getting mixed in.

And the hope is that you create this circuit, if you will, that has all this computation in that. Going forwards is easy, because you've specified the multiplications and the mods and so on and so forth. But not all of these operations have simple inverses. And going backwards, which is what you need to do in order to break one-wayness, or discover the x given a y, is going to be harder and harder as the computations get more complex, OK?

So everyone have a sense of what one-wayness is? So that's one-wayness. There's four other properties, two of which are very related.

CR and TCR. So CR is collision resistance. It's infeasible to find x and x prime such that x not equal to x prime, and h of x equals h of x prime, which is of course a collision. OK? And that just says you have this crazy hash function where you can't discover collisions. Well it would be absolutely wonderful. In fact that's what we wanted when we built dictionaries.

But why don't we use SHA-3 in dictionaries? Why don't we use SHA-3 in dictionaries? Yeah?

AUDIENCE: Because it's more complicated than we need.

SRINIVAS DEVADAS: Yeah, it's horribly slow, right? It would take longer to compute the hash than access the dictionary, when you actually had a reasonable dictionary that maybe had some collisions. I mean you just go off and you have a linked list, you can afford a few collisions, what's the big deal, right? So it just doesn't make any sense to use this level of heavyweight hash function, even if it satisfies collision resistance-- which some of these are conjectured to do-- for the applications we've looked at.

But there'll be other apps where collision resistance is going to be important. So that's collision resistance. And then there's-- TCR is target collision resistance. It's a weaker form-- so sometimes people CR strong collision resistance, and TCR weak occlusion resistance. We'll use CR and TCR here.

And this says it's infeasible, given x-- so there's a specific x that you want to find a collision for, as opposed to just finding a pair that goes once to x and x prime. And any pair would suffice to break the collision resistance property. But TCR says is I'm going to give you a specific x. And I want you to find an x prime who's hash collides with the hash of x, OK? That's TCR.

OK that's TCR for you. And that just to be clear, I think you probably all got this, obviously we want this here because we have a deterministic hash function. And it's a trivial thing to say that if you had x, and you had x again, that you get the same hash back from it. That's a requirement, really. So we want two distinct x and x primes that are not equal that end up colliding. That's really what a collision is.

And so you see the difference between CR and TCR? Yup? Yeah?

AUDIENCE: Are we to assume that given an x it's very easy to get the h of x back?

SRINIVAS DEVADAS: So the question was, given an x, it's poly-time computation to get h of x. Absolutely. Public poly-time computation given an x to get h of x. So going this way is easy. Going this way-- I ran out of room-- hard. OK?

AUDIENCE: So does that mean that TCR is basically the same as 1?

SRINIVAS DEVADAS: No, no, no, absolutely not. TCR says it's OK. You can compute this. You can get x. And you can get h of x. So given x, you know that you can get h of x.

I didn't actually put that in the definition. And maybe I should have. So given x you can always get h of x. Remember that. It's easy to get h of x. So any time I say given x, you can always add it, saying given x and h of x.

So I'm given x. I'm given h of x. I obviously need to map-- I need to discover an x prime such that h of x prime equals h of x, OK? Now you have situations where for-- it may be the case that for particular x's you can actually do this. And that's enough to break TCR. So you have to have this strong property that you really don't want to find collisions are for some-- even if there's a constant fraction of x's that break the TCR property, you don't like your hash function, OK? Because you might end up picking those and go build security applications using those properties.

I want to talk a little bit about the relationship between OW, CR, and TCR. So I'm going to get back to that. And we're going to talking about hash functions that satisfy one property but don't satisfy the other. And I think your question will probably be answered better, OK? Thanks for the question.

So those are the main ones. And really quickly, if you want to spend a lot of time on this-- but I do want to put up-- I think I'll leave these properties up here for the duration. Because it's important for you to see these definitions as we look at the applications where we require these properties, or a subset of these properties.

But that we have pseudo randomness. And this is simply a function of the fact that-- so this is PRF-- we know we can't build a random oracle. And so we're going to have to do something that's pseudo-random. And basically what we're saying here is the behavior is indistinguishable from random.

So we're going to have to use non-linearity, things that are called non-linear feedback shift registers, to create pseudo-random functions. There's many ways that we can create pseudo-random functions. We won't really get into that. But obviously that's what we want.

And then the last one is a bit tricky. And we will have an app that requires this way at the end. But this is infeasible given h of x to produce h of x prime, where x and x prime are-- and it gets a little bit fuzzy here-- are related in some fashion, right? So a concrete example of this is, let's say that x prime is x plus 1. So this is a reasonable example of this.

So what this says is you're just given h of x. It doesn't actually say anything about one-wayness yet. But you could assume, for example, that if this was a one-way hash function, that it would be possible to get x from h of x, correct? And let's keep that though. Hold that thought, all right? We're going to get back to it.

So if I'm just given the hash through some computation, it may be possible for me to create another hash, h of x prime, such that there's some relationship that I can prove or argue for between the strings that created the hashes, namely x and x prime, OK? That's what malleability is, right? Now you might just go off and say here's an x, here's a y, here's h of x, and here's h of y. These look completely random. And you might go off-- I'm being facetious here-- I say that y is x's third cousin's roommate's brother-in-law or something, right? I mean just make something up, right?

So clearly there's got to be a strong, precise relationship between x and y. If in fact you could do this and get y equals x plus 1, that'd be a problem, right? But if you are-- and then you can do this sort of consistently for different x's and y's, that would absolutely be a problem, right? But what you're really asking for-- and typically when you want non-malleability-- it's things where you have auctions, for example, where you are to be careful about making sure that you don't want to expose your bid. And so maybe what you're doing is exposing h of x.

You don't want somebody to look at your h of x and figure out how they could beat your bid by just a little bit. Or in case of Vickrey auctions, where the second highest bidder wins, now just be a little bit below you, right? So that's the kind of thing that you want to think about when it comes to non-malleability, or malleability, where you want a strong relationship between two strings that are related in some ordered fashion, like x equals-- x prime equals x plus 1, or just x prime equals 2 times x.

And you don't want to be able to-- you don't want the adversary to be able to discover these new strings. Because that would be the system, all right? So any questions about properties? Are we all good on these properties? All right, because I'm going to start asking you how to use them for particular applications, or what properties are required for certain applications, OK?

One last thing before we get there. I promised a slightly more detailed analysis of the relationships between these properties. So let's do that. Now if your just look at it, eyeball it, and you look at collision resistance and TCR, what can I say about the relationship between CR and TCR? If h is CR, it's going to be TCR, right? It's got to be TCR. It's a strictly stronger requirement.

But not reverse. And you can actually give a concrete example of a particular hash function that is TCR. I'm not going to go there. It's actually a little more involved than you might think it is, where a TCR hash function is not collision resistant. But you can see that examples such as these should exist, simply because I have a more stringent property corresponding to collision resistance as opposed to TCR, right?

So if you're interested in that particular example, you're not responsible for it, get in touch with me and I'll point you to a, like a three-page description of an example. So I didn't really want to go in there. But what I do want to do is talk about one-wayness and collision resistance. Because I think that's actually much more interesting, all right?

So if h is one-way-- any conjectures as to what the question mark is in the middle? Can I make strong statements about the collision resistance of a hash function, if I'm guaranteed that the hash function I have is a one-way hash function, or vice versa? Another way of putting it is, can you give me an example of, just to start with, a hash function which is one-way but not TCR, not target collision resistant?

So I'm going to try and extract this out of you. This is somewhat subtle. But the way you want to think about this is, let's say that h of x is OW and TCR, OK? And so I have a bunch of inputs. And this is the output. And I get d-bits out. And I've got x1, x2, to xn, OK?

Now I've given this h-- I've been given this h which is one-way and TCR. It satisfies those properties that you have up there. In the case of one-way, I give you an arbitrary d-bit string. You can't go backwards and find a bunch of the xi's that produce exactly that d-bit string, all right? So it's going to be hard to get here.

But you're allowed now to give me an example. So this is some hash function that you can create, which may use h as well. And h is kind of nice because it has this one-way property. So let's say that we want to discover something where one-way does not imply TCR. So I want to cook up a hash function h prime such that h prime is one-way, but it's not TCR, OK?

The way you want to think about this is you want to add to h. And you want to add something to h such that it's still hard-- if you add h it's still hard to go from here to there. Because you've got to go deeper. If you add to, for example, the inputs of h. Or you could add to the outputs of h as well, or the outputs of the current h.

But you can basically go deeper, or need to go deeper in order to find the break one-wayness, in order to find an x, whatever you have, that produces the d-bit string that you have, right? So what's a simple way of creating an h prime such that it's going to be pretty easy to find targeted collisions even, not necessarily collisions, it's pretty easy to find targeted collisions, without breaking the one-way property of h? Yeah?

AUDIENCE: So if you have x sub i, if i odd then return h of x of i. So that's minus 1. So return the even group.

AUDIENCE: Given x any x of i, you can usually find another x of i that was the same output? You can go backwards.

SRINIVAS DEVADAS: You can't go backwards. Yeah, that's good. That's good. I'm going to do something that's almost exactly what you said. But I'm going to draw it pictorially.

And what you can do, you can do a parity, like odd and even that was just described. And all I'll do is add a little [? XNOR ?] gate, which is a parity gate, to one of the inputs. So you have and b here. So I've taken x1, and I have a and b here. So I've added-- I can add as many inputs as I want to this function.

Oh I should mention by the way, h of x is working on arbitrary strings. And obviously I put in some number here that corresponds to n, which is a fixed number. So you might ask, what the heck happened here with respect to arbitrary strings? And there's two answers.

The first answer is, well, ignore arbitrary. And assume that you only have n-bit strings. And n this is really large number, right? And that may not be particularly satisfying.

The other answer is, which is more practical, which is what's used in practice, is that typically what happens is, you do have particular implementations of hash functions that obviously need to have fixed inputs, n, for example. And n is typically 512. It's usually the block size. And you chunk the input up into five 12-bit blocks.

And typically what you do is, you take the first five 12-bits, compute the hash for it. And then you can do it for the remaining blocks. And then you can hash all of them together, all right? So there's typically more invocations. I don't really want to get into it. But there's typically more invocations of h when the input would be 2 times n, or 3 times n, all right?

So we don't really need to go there for the purposes of this lecture. But keep that in mind. So we'll still stick with our arbitrary string requirement.

So having said that, take a look at this picture. And see what this picture implies. I have an h prime that I've constructed, right? Now if I look at h prime, and I give you an output for h prime-- so h prime now has, it's a function of a and b, and x2 all the way to xn, right? So it's got an extra input.

If I look at h prime, and I look at the output of h prime that is given to me, and I need to discover something that produces that, it is pretty clear that I need to figure out what these values are, all right? And I need to know what the parity of a and b is. And maybe I don't need to know exactly what a and b are, but I absolutely need to know what the parity of a and b are, because that's x1. And the one-way I'd break would require me to tell you what the value of x1 is, and the value of x2, and so on and so forth.

So it's pretty clear that h prime is one-way, right? Everybody buy that? h prime is one-way.

But you know what? I've got target collisions galore, right? All I have to do is flip-- I have a equals 1 and b equals 1. And I have a equals 0 and b equals 0. They're going to give me the same hash, right?

So trivial example, but that gets to the essence of the difference between collision resistance and one-wayness, target collision resistance and one-wayness, all right? So this is one-way but not TCR, simply because a equals 0, b equals 0 for arbitrary x's produce the same thing as a equals 1 and b equals 1, right? So those are collisions.

So admittedly contrived. But it's a counterexample. Counterexamples can be contrived. It's OK.

All right. So that was what happens with that. Let's look at one more interesting thing that corresponds to the other way, right? So what I want to show is that a TCR does not imply one-wayness. OK, so now I want an example where it is clear that I have target collision resistance, because I can just assume that. And we're going to use the same strategy.

I'm just going assume that I have an h that's target collision resistant. And I'm going to try and cook up an h prime that is not one-way. So I'm going to assume that in fact h is TCR and OW.

And I'm going to take away one of the properties. And if I take it one of the properties I have a counterexample, right? So think about how you could do this.

You have h as before. And I want to add some stuff around it such that it's going to be easy to discover-- for a large, for a constant fraction of hashes that I've given to me, not for any old hash. Because you can always claim that one-wayness is broken by saying I have x, I computed h of x, now I know what-- given h of x I know what x is. I mean you can't do that, right? So that's not breaking the one-wayness of it.

It's when you have an h of x and this is the first time you've seen it, you're trying to find what x is, right? So how would you-- how would you set it up so you break the one-wayness of h without necessarily breaking the target collision resistance of the overall hash function that you're creating? And you have to do something with the outputs, OK? You have to do something.

This is a little more involved. It's not as easy as this example. It's a little more involved. But any ideas? Yeah, go ahead.

AUDIENCE: So x is less than b returns x. If x is greater than b, return [INAUDIBLE].

SRINIVAS DEVADAS: Beautiful. Right. What color did you get last time?

AUDIENCE: Blue.

SRINIVAS DEVADAS: You got a blue last time? All right. Well you get a purple. You have a set. Actually we have these red ones that are precious, that are-- no, we don't. We chose not to do red. I don't know. There was some subliminal message I think with throwing red Frisbees that we didn't like.

But OK. So thank you. And h of x is simply something where I'm going to concatenate a zero to the x value and just put it out. And clearly this is breaking one-wayness because I'm just taking the input, I'm adding a zero to it, and shipping it out. So it's going to be easy to go backwards, right?

And this only happens if x is less than n, as the gentleman just said. Less than or equal to n in terms of the input length, OK? Otherwise I'm going to do h of x. So this is good news. Because I'm actually using the hash function in the case where I have a longer input string. This is bad news for one-wayness because I'm just piping out the input.

And so if I get an x, and I see what the x is out here, and let's just say for argument's sake that-- you could even say that n is going to be something that is less than d, which is the final output, which has d-bits. And so if you see something that h prime produces that's less than d-bits you instantly know that you can go backwards and discover what input produced that for the h prime, right? Because you just go off and you go backwards. This is what it tells you.

Now on the other hand if it's larger obviously you can't do that. But there's a whole lot of combinations that you can do that for. So this breaks one-wayness, OK?

Now you think about TCR. And what you want a show of course is that this maintains TCR. So that's the last thing that we have to show. We know that it breaks one-wayness. But if it broke TCR we don't quite have our example. So we want to show that it actually maintains TCR, which is kind of a weakish property that we need to maintain.

And the reason this maintains TCR is that there's really only two cases here obviously, corresponding to the if statement. And it's pretty clear that if x is less than or equal to n, clearly different x's produce different h prime x's, correct? Because I'm just passing along the x out to the output.

So if x is less than n I am going to get different hashes at the output. I'm just passing them out. So that's easy.

And for the other case, well I assume that h of x was CCR, correct? Because that was the original assumption, that I had h, which was CCR. So in both cases TCR is maintained because else h of x maintains TCR, all right? So again, a bit of a contrived example to show you the difference between these different properties so you know not to mix them up. You know what you want to ask for, what is required when you actually implement an application that depends on particular properties.

All right? Any questions so far about properties or any of these examples? We're going to dive in to using them.

OK. So start thinking computer security. Start thinking hackers, protecting yourself against the bad guys that are out there who are trying to discover your passwords, trying to corrupt your files, generally make your life miserable. And we'll start out with fairly simple examples, where the properties are somewhat obvious, and graduate to this auction bidding example which should be sort of the culmination of at least this part of the lecture.

And depending on how much time I have I'll tell you a little bit about how to implement hash functions. But I think these things are more important from a standpoint of giving you a sense of cryptographic hashes. All right. Password storage.

How many of you write your password in an unencrypted text file and store it in a readable location? There you go, man. Thank you for being honest. And I do worse. Not only do I do that, I use my first daughter's name for four passwords. I won't tell you what the name is.

So that's something that we'd like to fix, right? So what do real systems do? Real systems cannot protect against me using my first daughter's name as a password, right? So there's no way you can protect against that.

But if I had a reasonable password, which had reasonable entropy in it-- so let's assume here that we have reasonable entropy in the password. And you can just say 128-bits. And it's not a lot, right? 128-bits is 16 characters, OK? And you don't have to answer this-- how many of you have 16 characters in your password?

Oh I'm impressed. OK. So you've got 128-bits of entropy. But the rest of you, forget it. This is not going to help you, OK? But what I want, assuming you have significant entropy in your password-- because otherwise, if there's not enough entropy you can just enumerate all possible passwords of eight letters. And it's not that much. It's 2 raised to 50, what have you. And you can just go off. And none of these properties matter.

You just-- you have your h of x. It's public. We'll talk about how we use that in a second. But clearly if the domain is small you can just enumerate the domain. So keep that in mind.

I talked about h of x, and it's obviously going to be relevant here. But suppose I wanted to build a system, and this is how systems are built, ETC slash password file, assuming you have long passwords it does it this way, otherwise it needs something that's called a salt. But that's 6, 8, 57 and we won't go there. So we just assume a large entropy.

What is it that a system can do? What can it store in order to let you in, and only let you in when you type your password, and not let some bogus password into the system? Or somebody with a bogus password into the system. Yeah, go ahead.

AUDIENCE: If you capture the password when you enter it and compare it to what's stored--

AUDIENCE: If it's a one-way hash you know you have what the correct password is.

SRINIVAS DEVADAS: That's exactly right. That's exactly right. So it's a really simple idea, a very powerful idea. It, as I said, assumed that the entropy-- and I'm belaboring the obvious now-- but it is important when you talk about security to state your assumptions. But you do not store password on your computer. And you store the hash of the password.

Now why do I store my password on the computer? Because this is so inconvenient, right? So this is what the system does for me. But the fact of the matter is, if I lose my password, this doesn't help me. Because what the system wants you to do is choose a password that is long enough, and the h is one-way. So anybody who discovers h of PW that is publicly readable cannot discover PW, all right? That's what's cool about this.

How do you let the person log in? Use h of PW to compare against h of PW prime, which is what is entered, where PW prime is the typed password.

And clearly what we need is the disclosure of h of PW should not reveal PW. So we definitely need one-wayness. What about-- what about collision resistance? Our target collision resistance?

Think practitioner now, right? Are we interested in this hash function being collision resistant? What does that mean in this case? Give me the context in this particular application? Yeah, go ahead.

AUDIENCE: It means that someone entering a different password will have the same hash [INAUDIBLE].

SRINIVAS DEVADAS: Exactly. So it means that what you have is a situation where you do not reveal-- and so what might happen is that h of PW prime equals h of PW. But h of PW equals h of PW prime. But PW is not equal to PW prime. What you have is a false positive.

Someone who didn't know your password but guessed right-- and this is a 128-bit value, and they guessed right-- is going to get it. You don't particularly care of the probability of this occurrence. It's really small. Typically you're going to have systems that lock you out if you try 10 tries that occurs one, two, wrong passwords, right?

So really in systems you do not require-- you do want to build systems that have minimal properties with respect to the perimeters that are used. So from a system building standpoint just require OW. Don't go overboard. Don't require collision resistance or TCR, OK?

Let's do a slightly different example. Also a bit of a warm-up for what's coming next, which is a file modification detector. So for each file F, I'm going to store h of F. And as securely. So you assume that this means that h of F cannot be modified by anybody, h of F itself.

And now we want to check if F is modified by re-computing h of F. Which could be, this could be modified. So this could actually be F prime. You don't know that.

You have a file. It's a gigabyte. And somebody might have tampered with one of the bits in the file. All you have is a d-bit digest that corresponds to h of F that you stored in a secure location. And you want to check to see, by re-computing h of F, the file that is given to you, and comparing it with what you've stored, the h of F that you've stored.

And so what property do we need in order to pull this off? Of hash functions. What precisely do we need to pull this off? What is the adversary trying to do? And what is a successful break?

A successful break is if an adversary can modify the file and keep h of F the same, right? That would be a successful break, right? Yup. Go ahead.

AUDIENCE: TCR.

SRINIVAS DEVADAS: TCR? Yeah, absolutely. You need TCR. So you want to modify the file. So you're given that the file-- the adversary is given the file, which is the input to the hash, and is going to try and modify-- modify the file, right?

So let's do a couple more. And we're going to advance our requirements here a little bit. So those two are basic properties. I want to leave this up there.

We're going to do something that corresponds to digital signatures. So digital signatures are this wonderful invention that came out of MIT in a computer science laboratory-- again, Ron Rivest and collaborators-- which are a way of digitally signing a document using a secret key, a private key. But anybody who has access to a public key, so it could be pretty much anybody, could verify the authenticity of that signature, right? So that's what a digital signature is.

So we're going to talk about public cryptography on Thursday, in terms of how you could build systems or encryption algorithms that are public key algorithms. But here I'll just tell you what we want out of them. Essentially what we have here in the case of signatures, we actually want to talk about encryption here, are-- there's two keys associated with a public key system.

Anybody and everybody in the system would have a public key that you can put on your website. And you also have a secret key-- that's like your password-- that you don't want to write down, you don't want to give away, because that's effectively your identity. And what digital signatures respond to are that you have two operations. You have signing and verification.

So signing means that you create a signature sigma that is the sign using your private key, your secret key, off a message M. So you're saying this is this message, it came from me, right? That's what signing means. You have this long message and you sign it at the bottom. You're taking responsibility for the contents of that message.

And then verification is you have M sigma and a public key. And this is simply going to output true or false. And so the public key should not reveal any information about the secret key. And that's the challenge of building PKI systems, that we'll talk about in some detail next time. But we don't need to think about that other than acknowledging it today.

So the public and private key are two distinct things, neither one of which reveals anything about the other. Think of them as completely distinct passwords. But they happen to be mathematically related. That's why this whole thing works. And that mathematical relationship we'll look at in some detail on Thursday.

But having said that, take a look at what this app is doing for us, right? This is a security application. And I haven't quite gotten to hash functions yet. But I'll get to it in just a minute.

But what I want to do is emphasize that there's two operations going on. One of which is a signature, which is a private signature, in the sense that it's private to me, if I'm Alice. Or private to Alice. And you're using secret information on this public message, M, because that's going to be publicized. And you're going to sign the public message.

And then anybody in the world who has access to Alice's public key is going to be able to say, oh I'm looking at the signature, which is a bunch of bits. I'm looking at the message, which is a whole lot of bits. And I have this public key, which is a bunch of bits. And I'm going to be able to tell for sure that either Alice signed this message, or Alice did not sign this message.

And the assumption here is that Alice kept her private key secret. And of course, what I just wrote there, that the public key does not reveal anything about the secret key, OK? So that's digital signatures for you, in a nutshell. And when you do MIT certificates you're using digital signatures a la Rivest-Shamir-Adleman, the RSA algorithm. So you're using this all the time, when you click on 6.046 links, for example.

And what happens is M is typically really large. I mean it could be a file, right? It could be a large file. And you don't necessarily want to compute these operations on large files. So for convenience, what happens is you end up hashing the file. And for large M it's easier to sign h of M. And so replace the M's that you see here with h of M, all right?

So now that we're given that we're going to be doing h of M in here, think about what we wanted to accomplish with M, right? I told you what we wanted to accomplish with M. There's a particular message. I'm Alice. I'm going to keep my secret key secret. But I want to commit to signing this message M, all right? And I want to make sure that nobody can pretend to be me who doesn't know my secret key. And nobody does.

So if I'm going to be signing the hash of the message, now it comes down to today's lecture. I'm signing the hash of the message h of M. What property do I require of h in order for this whole thing to work out? Yeah, go ahead.

AUDIENCE: Is it non-malleability?

SRINIVAS DEVADAS: Non-malleability, but even before that-- suppose-- absolutely, but non-malleability is kind of beyond one of these properties over on the right. You're on the right track, right? So do you want to give me a different answer? You can give me a different answer.

AUDIENCE: Oh, I'm not sure.

SRINIVAS DEVADAS: OK. What? Yeah, back there.

AUDIENCE: I think you wanted to one-way because otherwise you could take that signature and find another message that you could credit.

SRINIVAS DEVADAS: I can make M public. I can make M-- M can be public. And h of M is public. So one-wayness is not interesting for this example if M is public. And we can assume that M eventually gets public. Because that's the message I'm signing, right? I can also put M out.

So I want the relationship-- I want you to focus on the relationship between h of M and M and tell me what would break this system. And you're on the right track. Yeah, go ahead. Or way back there. Yeah, sorry about that.

AUDIENCE: TCR.

AUDIENCE: [INAUDIBLE].

SRINIVAS DEVADAS: So I have M. So what happens here-- I should write this out. I'm given-- as an adversary I have M and h of M. It is bad if Alice signs h of M, but Bob claims Alice signed M prime. Because h of M equals h of M prime, right? That is bad.

So the M is public-- could you stand up? M is given. There's a specific M, and a specific h of M in particular, that has been exposed. And h of M is what was used for the signature. So you want to keep h of M the same. It's a specific one. So it's not collision resistance, it's target collision resistance, because that's given to you.

And you want to keep that the same. But you want to claim that oh, you promised me \$10,000, not \$20, right? If you can do that, you signed saying you want to pay \$10,000, not \$20, then you've got a problem.

So your thing is very close. It's just that it doesn't need to be a strong relationship between the 10,000 or the 20. I mean I give you a concrete example of that. But it could be more, it could be less. Anything that is different from what you signed, be it with the numerical relationship or not, would cause a problem and break this scheme, all right? Are we good?

All right, one last example, the most interesting one. And as I guessed I'm probably not going to get to saying very much about how cache functions are implemented. But maybe I'll spend a minute or two on it. So let's do this example that has to do with commitments.

Commitment is important, right? You want to commit to doing things. You want to keep your promises. And in this case we have a legal requirement that you want to be able to make people honor their commitments, and not weasel their way out of commitments, right? And we want to deal with this computationally.

And let's think about auctions. So Alice has value x, e.g. an auction bid. Alice computes what we're going to call C of x, which is a commitment of x, and cements it, right? C of x, C of x is-- let's assume that the auctioneer, and perhaps other auctionees as well, see C of x. You have to submit it to somebody, right? So you can assume that that's exposed.

And what is going to happen is, when bidding is over Alice is going to open-- so this is-- C of x can be thought of as sealing the bid. So that's the commitment. You're sealing the-- you're making a bid and you're sealing it in an envelope. You've committed to that. That's obviously, what happens in real life without cryptography, but we want to do this with cryptography, with hash functions.

And so now Alice opens C of x to reveal x. So she has to prove that in fact x was her bid. And that it matches what she sealed. When you open it up, think about it conceptually from a standpoint of what happens with paper, and then we have to think about this computationally and what this implies, right?

So again I'll do a little bit of set up. And then we have start talking about the properties that we want for this particular application. So there are a bunch of people who are doing bidding for this auction. I don't-- I want to be the first-- I don't want to spend a lot of money. But I want to win. All of us are like that, right?

If I know information about your bid, that is obviously a tremendous advantage. So clearly that can't happen, right? If I know one other person's bid I just do plus 1 on that. If I know everybody else's I just do plus 1 on the maximum. So clearly there's some secrecy that's required here, correct?

So C of x is going to have to do two things. It can't reveal x. Because then even maybe the auctioneer is bad. Or other people are looking at this. And you can just assume that C of x is-- the C of x's are all public.

But I also need a constraint that's associated with C of x that corresponds to making sure Alice is honest, correct? So I need to make Alice commit to something, right? So what are the different properties of the hash function that if I use h of x here, that I'd want h to satisfy in order for this whole process to work like it's supposed to work with paper and envelopes? Yeah, go ahead.

AUDIENCE: It has to be one-way [INAUDIBLE].

SRINIVAS DEVADAS: It has to be one-way. And explain to me-- so I want a description of it has to be one-way, because why?

AUDIENCE: Because you want all the c x's to be hidden from all the other options.

SRINIVAS DEVADAS: Right. C of x should not reveal x, all right? All right. That's good. Do you have more? It has to be collision resistant. OK. I guess.

A little bit more. You're getting there. What-- why is it collision resistant?

AUDIENCE: Because you want to make sure that Alice, when she makes a bid that she commits that bid. If she's not going to resist it then she could bid \$100 and then find something else.

SRINIVAS DEVADAS: That's exactly right. So CR, because Alice should not be able to open this in multiple ways, right? And in this case it's not TCR in the sense that Alice controls what her bids are. And so she might find a pair of bids that collide, correct? She might realize that in this particular hash function, you know \$10,000 and a billion dollars collide, right?

And so she figures depending on what happens, she's a billionaire, let's assume. She's going to open the right thing. She's a billionaire, but she doesn't necessarily want to spend the billion, OK? So that's that, right? But I want more. Go ahead.

AUDIENCE: You don't want it to be malleable. Assuming that the auctioneer is not honest because you don't want to accept a bribe from someone and then change everyone else's bid to square root of whatever they bid.

SRINIVAS DEVADAS: That's exactly right. Or plus 1, which is a great example, right? So there you go. I ran out of Frisbees. You can get one next time.

So yeah, I don't need this anymore. You're exactly right. There's another-- it turns out it's even more subtle than what you just described. And I think I might be able to point that out to you. But let me just first describe this answer, which gives us non-malleability.

So the claim is that you also want non-malleability in your hash function. And the simple reason is, given C of x-- and let's assume that this is public. It's certainly public to the auctioneer, and it could be public to the other bidders as well.

Because the notion of sealing is that you've sealed it using C of x. But people can see the outside of the envelope, which is C of x. So everyone can see C of x. You still want this to work, even though all other bidders can see C of x.

So given C of x, should not be possible to produce C of x plus 1. You don't know x is. But if you can produce C of x plus 1, you win, all right? And so that's the problem.

Now it turns out you now say OK, am I done? I want these three properties. And I'm done, right? There's a little subtlety here which these properties don't capture. So that's why there's more here. And I don't mean to titillate, because I'll tell you what is missing here.

But let's say that I have a hash function that looks like this. And this here is non-malleable. It is collision resistant. And it's one-way, all right? So h of x has all these wonderful properties, all right?

I'm creating an h prime x that looks like this, which is a concatenation of h of x, and giving away the most significant bit of x, which is my bid, right? I'm just giving that away, right? The problem here is that we haven't really made our properties broad enough to solve this particular application to the extent that there's contrived cases where these properties aren't enough, OK?

And the reason is simple. h prime x is arguably NM, CR, and OW. And I won't go into to each of those arguments. But you can think about it, right?

If I'm just giving you one bit, there's 159 others, there's a couple of hundred others, whatever it is that I have in the domain. It's not going to be invertible. h prime x is not going to be invertible if h of x is not invertible. h prime x is not going to be breakable in terms of collision resistance if h of x is not breakable, and so on and so forth.

But if I had a hash function like that, is it a good hash function for my commitment application? No, obviously not. Because if I publicize this hash function-- remember, everything is public here with respect to h and h prime-- you are giving away the most significant that corresponds to your bid in this particular hash function, right?

So you really need a little bit more than these for secrecy, for true secrecy. But in the context of this example, I mean it's common sense that you would not use the hash function like that, right? So it's not that there's anything profound here. It's just that I want to make sure that you understand the nuances of the properties that we're requiring.

We had all the requirements corresponding to the definitions of NM and CR and OW. And you need a little bit more for this example, where you have to say something, perhaps informally, like the bits of your auction are scrambled in the final hash output, which most hash functions should do anyway, and h of x will definitely do. But you kind of unscrambled it by adding this little thing in here, corresponding to the most significant thing, all right?

So I'll stop with that. Let me just say that the operation-- or sorry, the work involved in creating hash functions that are poly-time computable is research work. People put up hash functions and they get broken, like MD4 was put up in '92 and then got broken, SHA-1 and so on and so forth. And so I just encourage you to look up SHA-3 and just take a quick scan and what the complexity of SHA-3 is with respect to computing the hash given an arbitrary string, all right?

I'll stick around for questions.