Lecture 24: Large Deviations

Flash and JavaScript are required for this feature.

Download the video from iTunes U or the Internet Archive.

Description: Covers large deviation. Like expectation, it gives three other notions in solving bounds and many frequently experienced problems in computer science, such as determining the probability a random variable will deviate from its expectation.

Speaker: Tom Leighton

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

PROFESSOR: So today, we're going to talk about the probability that a random variable deviates by a certain amount from its expectation. Now, we've seen examples where a random variable is very unlikely to deviate much from its expectation. For example, if you flip 100 mutually independent fair coins, you're very likely to wind up with close to 50 heads, very unlikely to wind up with 25 or fewer heads, for example.

We've also seen examples of distributions where you are very likely to be far from your expectation, for example, that problem when we had the communications channel, and we were measuring the latency of a packet crossing the channel. There, most of the time, your latency would be 10 milliseconds. But the expected latency was infinite. So you're very likely to deviate a lot from your expectation in that case.

Last time, we looked at the variance. And we saw how that gave us some feel for the likelihood of being far from the expectation-- high variance meaning you're more likely to deviate from the expectation. Today, we're going to develop specific tools for bounding or limiting the probability you deviate by a specified amount from the expectation. And the first tool is known as Markov's theorem.

Markov's theorem says that if the random variable is always non-negative, then it is unlikely to greatly exceed its expectation. In particular, if R is a non-negative random variable, then for all x bigger than 0, the probability that R is at least x is at most the expected value of R, the mean, divided by x.

So in other words, if R is never negative-- for example, say the expected value is smaller. Then the probability R is large will be a small number. Because I'll have a small number over a big number.

So it says that you are unlikely to greatly exceed the expected value. So let's prove that. Now, from the theorem of total expectation that you did in recitation last week, we can compute the expected value of R by looking at two cases-- the case when R is at least x, and the case when R is less than x. That's from the theorem of total expectation.

I look at two cases. R is bigger than x. Take the expected value there times the probability of this case happening plus the case when R is less than x. OK, now since R is non-negative, this is at least 0. R can't ever be negative. So the expectation can't be negative. A probability can't be negative. So this is at least 0.

And this is trivially at least x. Because I'm taking the expected value of R in the case when R is at least x. So R is always at least x in this case. So its expected value is at least x. So that means that the expected value of R is at least x times the probability R is greater than x, R is greater or equal to x.

And now I can get the theorem by just dividing by x. I'm less than or equal to the expected value of R divided by x. So it's a very easy theorem to prove. But it's going to have amazing consequences that we're going to build up through a series of results today. Any questions about Markov's theorem and the proof?

All right, there's a simple corollary, which is useful. Again, if R is a non-negative random variable, then for all c bigger than 0, the probability that R is at least c times its expected value is at most 1 and c. So the probability you're twice your expected value is at most 1/2.

And the proof is very easy. We just set x to be equal to c times the expected value of R in the theorem. So I just plug in x is c times the expected value of R. And I get expected value of R over c times the expected value of R, which is 1/c. So you just plug in that value in Markov's theorem, and it comes out.

All right, let's do some examples. Let's let R be the weight of a random person uniformly selected. And I don't know what the distribution of weights is in the country. But suppose that the expected value of R, which is the average weight, is 100 pounds. So if I average over all people, their weight is 100 pounds.

And suppose I want to know the probability that the random person weighs at least 200 pounds. What can I say about that probability? Do I know it exactly? I don't think so. Because I don't know what the distribution of weights is.

But I can still get an upper bound on this probability. What bound can I get on the probability that a random person has a weight of 200 given the facts here? Yeah.


PROFESSOR: Yes, well, it's 100 over 200, right. It's at most the expected value, which is 100, over the x, which is 200. And that's equal to 1/2. So the probability that a random person weighs 200 pounds or more is at most 1/2.

Or I could plug it in here. The expected value is 100. 200 is twice that. So c would be 2 here. So the probability of being twice the expectation is at most 1/2.

Now of course, I'm using the fact that weight is never negative. That's obviously true. But it is implicitly being used here. So what fraction of the population now can weigh at least 200 pounds? Slightly different question. Before I asked you, if I take a random person, what's the probability they weigh at least 200 pounds? Now I'm asking, what fraction of the population can weigh at least 200 pounds if the average is 100? What is it? Yeah?

AUDIENCE: At most 1/2.

PROFESSOR: At most 1/2. In fact, it's the same answer. And why? Why can't everybody weigh 200 pounds, so it would be all the population weighs 200 pounds at least?


PROFESSOR: Probability would be 1, and that can't happen. And in fact, intuitively, if everybody weighs at least 200 pounds, the average is going to be at least 200 pounds. And we said the average was 100. And this is illustrating this interesting thing that probability implies things about averages and fractions. Because it's really the same thing in disguise. The connection is, if I've got a bunch of people, say, in the country, I can convert a fraction that have some property into a probability by just selecting a random person. Yeah.


PROFESSOR: No, the variance could be very big. Because I might have a person that weighs a million pounds, say. So you have to get into that. But it gets a little bit more complicated. Yeah.


PROFESSOR: No, there's nothing being assumed about the distribution, nothing at all, OK? So that's the beauty of Markov's theorem. Well, I've assumed one thing. I assume that there is no negative values. That's it.


PROFESSOR: That's correct. They can distribute it any way with positive values. But we have a fact here we've used, that the average was 100. So that does limit your distribution. In other words, you couldn't have a distribution where everybody weighs 200 pounds. Because then the average would be 200, not 100. But anything else where they're all positive and they average 100, you know that at most half can be 200. Because if you pick a random one, the probability of getting one that's 200 is at most 1/2, which follows from Markov's theorem.

And that's partly why it's so powerful. You didn't know anything about the distribution, really, except its expectation and that it was non-negative. Any other questions about this? I'll give you some more examples. All right, here's another example. Is it possible on the final exam for everybody in the class to do better than the mean score? No, of course not. Because if they did, the mean would be higher. Because the mean is the average.

OK, let's do another example. Remember the Chinese appetizer problem? You're at the restaurant, big circular table. There's n people at the table. Everybody has one appetizer in front of them. And then the joker spins the thing in the middle of the table. So it goes around and around. And it stops in a random uniform position.

And we wanted to know, what's the expected number of people to get the right appetizer back? What was the answer? Does anybody remember? One. So you expect one person to get the right appetizer back.

Well, say I want to know the probability that all n people got the right appetizer back. What does Markov tell you about the probability that all n people get the right appetizer back? 1/n. The expected value is 1. And now you're asking the probability that you get R is at least n. So x is n. So it's 1 in n.

And what was the probability, or what is the actual probability? In this case, you know the distribution, that everybody gets the right appetizer back, all n. 1 in n. So in the case of the Chinese appetizer problem, Markov's bound is actually the right answer, right on target, which gives you an example where you can't improve it.

By itself, if you just know the expected value, there's no stronger theorem that way. Because Chinese appetizer is an example where the bound you get, 1/n, of n people getting the right appetizer is in fact the true probability.

OK, what about the hat check problem? Remember that? So there's n men put the hats in the coat closet. They get uniformly randomly scrambled. So it's a random permutation applied to the hats. Now each man gets a hat back. What's the expected number of men to get the right hat back?

One, same as the other one. Because you've got n men each with a 1 in n chance, so it's 1. Markov says the probability that n men get the right hat back is at most 1 in n, same as before. What's the actual probability that all n men get the right hat back?


PROFESSOR: 1 in n factorial. So in this case, Markov is way off the mark. It says 1 in n. But in fact the real bound is much smaller. So Markov is not always tight. It's always an upper bound. But it sometimes is not the right answer. And to get the right answer, often you need to know more about the distribution.

OK, what if R can be negative? Is it possible that Markov's theorem holds there? Because I use the assumption in the theorem. Can anybody give me an example where it doesn't work if R can be negative?


PROFESSOR: Yeah, good, so for example, say probability R equals 1,000 is 1/2, and the probability R equals minus 1,000 is 1/2. Then the expected value of R is 0. And say we asked the probability that R is at least 1,000. Well, that's going to be 1/2. But that does not equal the expected value of R/1,000, which would be 0.

So Markov's theorem really does need that R to be non-negative. In fact, let's see if we saw where we used it in the proof. Anybody see where we use that fact in the proof, that R can't be negative? What is it?


PROFESSOR: Well, no, because x is positive. We said x is positive. So it's not used there. But that's a good one to look at. Yeah?

AUDIENCE: [INAUDIBLE] is greater than or equal to 0.

PROFESSOR: Yeah, if R can be negative, then this is not necessarily a positive number. It could be a negative number. And then this inequality doesn't hold. OK, good. All right, now it turns out there is a variation of Markov's theorem you can use when R is negative. Yeah.

AUDIENCE: [INAUDIBLE] but would it be OK just to shift everything up?

PROFESSOR: Yeah, yeah, that's great. If R has a limit on how negative it can be, then you make an R prime, which just adds that limit to R, makes it positive or non-negative. And now use Markov's theorem there. And that is now an analogous form of Markov's theorem when R can be negative, but there's a lower limit to it. And I won't stay to improve that here. But that's in the text and something you want to be familiar with.

What I do want to do in class is another case where you can use Markov's theorem to analyze the probability or upper bound the probability that R is very small, less than its expectation. And it's the same idea as you just suggested. So let's state that.

If R is upper bounded, has a hard limit on the upper bound, by u for some u in the real numbers, then for all x less than u, the probability that R is less than or equal to x is at most u minus the expected value of R over u minus x.

So in this case, we're getting a probability that R is less than something instead of R is bigger than something. And we're going to do it using a simple trick that we'll be sort of using all day, really. The probability that R is less than x, this event, R is less than x, is the same as the event u minus R is at least u minus x.

So what have I done? I put negative R over here, subtract x from each side, add u to each side. I've got to put a less than or equal to here. So R is less than or equal to x if and only if u minus r is at least u minus x. It's simple math there.

And now I'm going to apply Markov to this. I'm going to apply Markov to this random variable. And this will be the value I would have had for x up in Markov's theorem. Why is it OK to apply Markov to u minus R?

AUDIENCE: You could just define the new random variable to be u minus R.

PROFESSOR: Yeah, so I got a new random variable. But what do I need to know about that new random variable to apply Markov?

AUDIENCE: u is always greater than R.

PROFESSOR: u is always greater than R, or at least as big as R. So u minus R is always non-negative. So I can apply Markov now. And when I apply Markov, I'll get this is at most-- maybe I'll go over here. The probability that-- ooh, not R here. This is probability.

The probability that u minus R is at least u minus x is at most the expected value of that random variable over this value. And well now I just use the linearity of expectation. I've got a scalar here. So this is u minus the expected value of R over u minus x. So I've used Markov's theorem to get a different version of it.

All right, let's do an example. Say I'm looking at test scores. And I'll let R be the score of a random student uniformly selected. And say that the max score is 100. So that's u. All scores are at most 100.

And say that I tell you the class average, or the expected value of R, is 75. And now I want to know, what's the probability that a random student scores 50 or below? Can we figure that out? I don't know anything about the distribution, just that the max score is 100 and the average score is 75. What's the probability that a random student scores 50 or less? I want to upper bound that.

So we just plug it into the formula. u is 100. The expected value is 75. u is 100. And x is 50. And that's 25 over 50, is 1/2. So at most half the class can score 50 or below. And state it as a probability question or deterministic fact if I know the average is 75 and the max is 100. Of course, another way of thinking about that is if more than half the class scored 50 or below, your average would have had to be lower, even if everybody else was right at 100. It wouldn't average out to 75. All right, any questions about that?

OK, so sometimes Markov is dead on right, gives the right answer. For example, half the class could have scored 50, and half could have gotten 100 to make it be 75. And sometimes it's way off, like in the hat check problem.

Now, if you know more about the distribution, then you can get better bounds, especially the cases when you're far off. For example, if you know the variance in addition to the expectation, or aside from the expectation, then you can get better bounds on the probability that the random variable is large.

And in this case, the result is known as Chebyshev's theorem. I'll do that over here. And it's the analog of Markov's theorem based on variance. It says, for all x bigger than 0, and any random variable R-- could even be negative-- the probability that R deviates from its expected value in either direction by at least x is at most of the variance of R divided by x squared.

So this is like Markov's theorem, except that we're now bounding the deviation in either direction. Instead of expected value, you have variance. Instead of x, you've got x squared, but the same idea. In fact, the proof uses Markov's theorem.

Well, the probability that R deviates from its expected value by at least x, this is the same event, or happens if and only if R minus expected value squared is at least x squared. I'm just going to square both sides here. OK, I square both sides. And since this is positive and this is positive, I can square both sides and maintain the inequality.

Now I'm going to apply Markov's theorem to that random variable. It's a random variable. It's R minus expected value squared. So it's a random variable. And what's nice about this random variable that lets me apply Markov's theorem? It's a square. So it's always non-negative.

So I can apply Markov's theorem. And my Markov's theorem, this probability is at most the expected value of that divided by this. That's what Markov's theorem says as long as this is always non-negative. All right, what's a simpler expression for this, the expected value of the square of the deviation of a random variable? That's the variance. That's the definition of variance.

So that is just the variance of R over x squared. And we're done. So Chebyshev's theorem is really just another version of Markov's theorem. But now it's based on the variance. OK, any questions?

OK, so there's a nice corollary for this, just as with Markov's theorem. It says the probability that the absolute value, the deviation, is at least c times the standard deviation of R. So I'm looking at the probability that R differs from its expectation by at least some scalar c times the standard deviation.

Well, what's that? Well, that's the variance of R over the square of this thing-- c squared times the standard deviation squared. What's the square of the standard deviation? That's the variance. They cancel, so it's just 1 over c squared. So the probability of more than twice the standard deviation off the expectation is at most 1/4, for example. All right, let's do some examples of that. Maybe we'll leave Markov up there.

OK, say we're looking at IQs. In this case, we're going to let R be the IQ of a random person. All right, now we're going to assume-- and this actually is the case-- that R is always at least 0, despite the fact that probably most of you have somebody you know who you think has a negative IQ. They can't be negative. They have to be non-zero.

In fact, IQs are adjusted. So the expected IQ is supposed to be 100, although actually the averages may be in the 90's. And it's set up so that the standard deviation of IQ is supposed to be 15. So we're just going to assume those are facts on IQ. And that's what it's meant to be.

And now we want to know, what's the probability a random person has an IQ of at least 250? Now Marilyn, from "Ask Marilyn," has an IQ pretty close to 250. And she thinks that's pretty special, pretty rare. So what can we say about that? In particular, say we used Markov. What could you say about the probability of having an IQ of at least 250? What does Markov tell us?


PROFESSOR: What is it?


PROFESSOR: Not quite 1 in 25, but you're on the right track. It's not quite 2/3. It's the expected value, which is 100, over the x value, which is 250. So it's 1 in 2.5, or 0.4. So the probability is at most 0.4, so 40% chance it could happen, potentially, but no bigger than that. What about Chebyshev? See if you can figure out what Chebyshev says about the probability of having an IQ of at least 250.

It's a little tricky. You've got to sort of plug it into the equation there and get it to fit in the right form. Chebyshev says that-- let's get in the right form. I've got probability that R is at least 250. I've got to get it into that form up there. So that's the probability that-- well, first R minus 100 is at least 150. So I've got R minus the expected value. I'm sort of getting it ready to apply Chebyshev here.

And then 150-- how many standard deviations is 150? 10, all right? So this is the probability that R minus the expected value of R is at least 10 standard deviations. That's what I'm asking. I'm not quite there. I'm going to use the corollary there. But I've got to get that absolute value thing in.

But it's upper bounded by the probability of the absolute value of R minus expected value bigger than or equal to 10 standard deviations. Because this allows for two cases. R is 10 standard deviations high, and R is 10 standard deviations low or more. So this is upper bounded by that.

And now I can plug in Chebyshev in the corollary form. And what's the answer when I do that? 1 in 100-- the probability of being off by 10 standard deviations or more is at most 1 in 100, 1 in 10 squared. So it's a lot better bound. It's 1% instead of 40%. So knowing the variance of the standard deviation gives you a lot more information and generally gives you much better bounds on the probability of deviating from the mean.

And the reason it gives you better bounds is because the variance is squaring deviations. So they count a lot more. All right, now let's look at this step a little bit more. All right, let's say here is a line, and here's the expected value of R. And say here's 10 standard deviations high here. So this will be more than 10 standard deviations. And this will be 10 standard deviations on the low side. So here, I'm low.

Now, this line here with the absolute value is figuring out the probability of being low or high. This is the probability that the absolute value of R minus its expected value is at least 10 standard deviations. What we really wanted to know for bound was just the high side.

Now, is it true that then, since the probability of high or low is 1 in 100, the probability of being high is at most 1 in 200, half? Is that true? Yeah?


PROFESSOR: Yeah, it is not necessarily true that the high and the low are equal, and therefore the high is half the total. It might be true, but not necessarily true. And that's a mistake that often gets made where you'll take this fact as being less than 100 to conclude that's less than 1 in 200. And that you can't do, unless the distribution is symmetric around the expected value. Then you could do it, if it's a symmetric distribution around the expected value. But usually it's not.

Now, there is something better you can say. So let me tell you what it is. But we won't prove it. I think we might prove it in the text. I'm not sure. If you just want the high side or just want the low side, you can do slightly better than 1 in c squared. That's the following theorem.

For any random variable R, the probability that R is on the high side by c standard deviations is at most 1 over c squared plus 1. So it's not 1 over 2c squared. It's 1 over c squared plus 1, and the same thing for the probability of being on the low side.

Let's see, have I written this right? Hmm, I want to get this as less than or equal to negative c times the standard deviation. So here I'm high by c or more standard deviations. Here I'm low. So R is less than the expected value by at least c standard deviations. And that is also 1 over c squared plus 1.

And it is possible to find distributions that hit these targets-- not both at the same time, but one or the other, hit those targets. So that's the best you can say in general. All right, so using this bound, what's the probability that a random person has an IQ of at least 250? It's a little better than 1 in 100.


PROFESSOR: Yeah, 1/101. So in fact, the best we can say without knowing any more information about IQs is that it's at most 1/101, slightly better. Now in fact, with IQs, they know more about the distribution. And the probability is a lot less. Because you know more about the distribution than we've assumed here. In fact, I don't think anybody has an IQ over 250 as far as I know. Any questions about this?

OK, all right, say we give the exam. What fraction of the class can score more than two standard deviations, get two standard deviations or more, away from the average, above or below? Could half the class be two standard deviations off the mean? No? What's the biggest fraction that that could happen?

What do I do? What fraction of the class can be two standard deviations or more from the mean? What is it?


PROFESSOR: 1/4, because c is 2. You don't even know what the mean is. You don't know what the standard deviation is. You don't need to. I just asked, you're two standard deviations off or more. At most, 1/4. How many could be two standard deviations high or better at most? 1/5-- 1 over 4 plus 1, good. OK, this holds true no matter what the distribution of test scores is. Yeah?


PROFESSOR: Which one? This one?


PROFESSOR: Oh, that's more complicated. That'll take us several boards to do, to prove that. And I forget if we put it in the text or not. It might be in the text, to prove that. Any other questions?

OK so Markov and Chebyshev are sometimes close, sometimes not. Now, for the rest of today, we're going to talk about a much more powerful technique. But it only works in a special case. Now, the good news is this special case happens all the time in practice. And it's the case when you're analyzing a random variable that itself is the sum of a bunch of other random variables. And we've seen already examples like that.

And the other random variables have to be mutually independent. And in this case, you get a bound that's called a Chernoff bound. And this is the same Chernoff who figured out how to beat the lottery. And it's interesting. Long after we started teaching this, originally this stuff was only taught, for Chernoff bounds, for graduate students. And now we teach it here. Because it's so important. And it really is accessible.

It'll be probably the most complicated proof we've done to establish a Chernoff bound. But Chernoff himself, when he discovered this, thought it was no big deal. In fact, he couldn't figure out why everybody in computer science was always writing papers with Chernoff bounds in them.

And that's because he didn't put any emphasis on the bounds in his work. But computer scientists who came later found all sorts of important applications. And we'll see some of those today. So let me tell you what the bound is. And the nice thing is it really is Markov's theorem again in disguise, just a little more complicated.

Theorem-- it's called a Chernoff bound. Let T1, T2, up to Tn be any mutually independent-- that's really important-- random variables such that each of them takes values only between 0 and 1. And if they don't, just normalize them so they do. So we're going to take a bunch of random variables that are mutually independent. And they are all between 0 and 1.

Then we're going to look at the sum of those random variables, call that T. Then for any c at least 1, the probability that the sum random variable is at least c times its expected value. So it's going to be the high side here-- is at most e to the minus z, and I'll tell you what that is in a minute, times the expected value of T where z is c natural log of c plus 1 minus c. And it turns out if c is bigger than 1, this is positive.

So that's a lot, one of the longest theorems we wrote down here. But what it says is that probability were high is exponentially small. As the expected value is big, the chance of being high gets really, really tiny. Now, I'm going to prove it in a minute. But let's just plug in some examples to see what's going on here.

So for example, suppose the expected value of T is 100. And suppose c is 2. So we expect to have 100 come out of the sum. The probability we get at least 200-- well, let's figure out what that is. c being 2 we can evaluate z now. It's 2 natural log of 2 plus 1 minus 2. And that's close to but a little larger than 0.38.

So we can plug z in, the exponent up there, and find that the probability that T is at least twice its expected value, namely at least 200, is at most e to the minus 0.38 times 100, which is e to the minus 38, which is just really small.

So that's just way better than any results you get with Markov or Chebyshev. So if you have a bunch of random variables between 0 and 1, and they're mutually independent, you add them up. If you expect 100 as the answer, the chance of getting 200 or more-- forget about it, not going to happen.

Now, of course Chernoff doesn't apply to all distributions. It has to be this type. This is a pretty broad class. In fact, it contains the class of all Bernoulli distributions. So I have binomial distributions. Because remember a binomial distribution-- well, remember binomial distributions? That's where T is the sum of Ti's.

In binomial, you have Tj is 0 or 1. It can't be in between. And with binomial, all Tj's have the same distribution. With Chernoff, they can all be different. So Chernoff is much broader than binomial. The individual guys here can have different distributions and attain values anywhere between 0 and 1, as opposed to just one or the other. Any questions about this theorem and what it says in the terms there?

One nice thing about it is the number of random variables doesn't even show up in the answer here. n doesn't even appear. Yeah.


PROFESSOR: Does not apply to what?


PROFESSOR: Yeah, when c equals 1, what happens is z is 0. Because I have a log of 1 is 0, and 1 minus 1 is 0. And if z is 0, it says your probability is upper bounded by 1. Well, not too interesting, because any probability is upper bounded by 1. So it doesn't give you any information when c is 0, none at all. But as soon as c starts being-- sorry, if c is 1. As soon as c starts being bigger than 1, which is sort of a case you're interested in, you're bigger than your expectation, then it gives very powerful results. Yeah.


PROFESSOR: Yeah, you can. It's true for n equals 1 as well. Now, it doesn't give you a lot of information. Because if c is bigger than 1 and n was 1, so it's using one variable, what's the probability that a random variable exceeds its expectation, c times its expectation?


PROFESSOR: Yeah, let's see now. Maybe it does give you information. Because the random variable has a distribution on 0, 1. That's right, so it does give you some information. But I don't think it gives you a lot. I have to think about that. What happens when there's just one guy?

Because the same thing is true. It's just now for a single random variable on 0, 1 the chance that your twice the expected value. I have to think about that. That's a good question. Does it do anything interesting there? OK, all right, so let's do an example of how you might apply this.

Say that you're playing Pick 4, and 10 million people are playing. And say in this version of Pick 4, you're picking a four digit number, four single digits. And you win if you get an exact match. So the probability of winning, a person winning, well, they've got to get all four digits right. That's 1 in 10,000, 10 to the fourth.

What's the expected number of winners? If I got 10 million people, what's the expected number of winners? What is it? We've got 10 million over 10,000, right? Because what I'm doing here is the number of winners, T, is going to be the sum of 10 million indicator variables. And the probability that any one of these guys wins is 1 in 10,000. So the expected number of winners is 1 in 10,000 added 10 million times, which is this.

Is that OK? Everybody should be really familiar with how to whip these things out. This for sure will have probably at least a couple questions where you're going to need to be able to do that kind of stuff on the final.

All right, say I want to know the probability of getting at least 2,000 winners, and I want to upper bound that just with the information I've given you. Well, any thoughts about an upper bound?


PROFESSOR: What's that?


PROFESSOR: Yeah, that's a good upper bound. What did you have to assume to get there? e to the minus 380 is a great bound. Because you're going to plug in expected value is 1,000. And we're asking for more than twice the expected value. So it's e to the minus 0.38 times 1,000. And that for sure is-- so you computed this. And that equals e to the minus 380. So that's really small.

But what did you have to assume to apply Chernoff? Mutual independence. Mutual independence of what?


PROFESSOR: The numbers people picked. And we already know, if people are picking numbers, they don't tend to be mutually independent. They tend to gang up. But if you had a computer picking the numbers randomly and mutually independently, then you would be e to the minus 380 by Chernoff if mutually independent picks.

Everybody see why we did that? Because it's a probability of twice your expectation. The total number of winners is the sum of 10 million indicator variables. And indicator variables are 0 or 1. So they fit that definition up there. And so we already figured out z is at least 0.38. And you're multiplying by the expected value of 1,000. That's e to the minus 380, so very, very unlikely.

What if they weren't mutually independent? Can you say anything about this, anything at all better than 1, which we know for any probability? Yeah?

AUDIENCE: It's possible that everyone chose the same numbers.

PROFESSOR: Yes, everyone could have chosen the same number. But that number only comes up with a 1 in 10,000 chance. So you can say something.

AUDIENCE: You can use Markov.

PROFESSOR: Use Markov. What does Markov give you? What does Markov give you? 1/2, yeah. Because you've got the expected value is 1,000 divided by the bound threshold, is 2,000, is 1/2 by Markov. And that holds true without any independence assumption.

Now, there is an enormous difference between 1/2 and e to the minus 380. Independence really makes a huge difference in the bound you can compute. OK, now there's another way we could've gone about this. What kind of distribution does T have in this case? It's binomial. Because it's the sum of indicator random variables, 0, 1's. Each of these is 0, 1. And they're all the same distribution. There's a 1 in 10,000 chance of winning for each one of them. So it's a binomial.

So we could have gone back and used the formulas we had for the binomial distribution, plugged it all in, and we'd have gotten something pretty similar here. But Chernoff is so much easier. Remember that pain we would go through with a binomial distribution, the approximation, Stirling's formula, [INAUDIBLE] whatever, the factorials and stuff? And that's a nightmare.

This was easy. e to the minus 380 was very easy to compute. And really at that point it doesn't matter if it's minus 381 or minus 382 or whatever. Because it's really small. So often, even when you have a binomial distribution, well, Chernoff will apply. And that's a great way to go. Because it gives you good bounds generally.

All right, let's figure out the probability of at least 1,100 winners instead of 1,000. So let's look at the probability of at least 100 extra winners over what we expect out of 10 million. We've got 10 million people. You expect 1,000. We're going to analyze the probability of 1,100. What's c in this case? We're going to use Chernoff.

1.1. So this is 1.1 times 1,000. And that means that z is 1.1 times the natural log of 1.1 plus 1 minus 1.1. And that is close to but at least 0.0048. So this probability is at most, by Chernoff, e to the minus 0.0048 times the expected number of winners is 1,000. So that is e to the minus 4.8, which is less than 1%, 1 in 100.

So that's pretty powerful. It says, you've got 10 million people who could win. The chance of even having 100 more than the 1,000 you expect is 1% chance at most-- very, very powerful. It says you really expect to get really close to the mean in this situation.

OK, a lot better-- Markov here gives you, what, 1,000 over 1,100. It says your probability could be 90% or something-- not very useful. Chebyshev won't give you much here either. So if you're in a situation to apply Chernoff, always go there. It gives you the best bounds. Any questions? This of course is why computer scientists use it all the time.

OK, actually, before I do more examples, let me prove the theorem in a special case to give you a feel for what's involved. The full proof is in the text. I'm going to prove it in the special case where the Tj are 0, 1. So they're indicator random variables.

But they don't have to have the same distribution. So it's still more general than you get with a binomial distribution. All right, so we're going to do a proof of Chernoff for the special case where the Tj are either 0 or 1. So they're indicator variables.

OK, so the first step is going to seem pretty mysterious. But we've been doing something like it all day. I'm trying to compute the probability T is bigger than c times its expectation. Well, what I'm going to do is exponentiate both of these guys and compute the probability that c to the T is at least c to the c times the expected value of T.

Now, this is not the first thing you'd expect to do, probably, if you were trying to prove this. So it's one of those divine insights that you'd make this step. And then I'm going to apply Markov, like we've been doing all day, to this. Now, since T is positive and c is positive, these are equal. And this is never non-negative.

So now by Markov, this is simply upper bounded by the expected value of that, expected value of c to the T, divided by this. And that's by Markov. So everything we've done today is really Markov in disguise.

Any questions so far? You start looking at this, you go, oh my god, I got the random variable and the exponent. This is looking like a nightmare. What is the expected value of c to the T, and this kind of stuff? But we're going to hack through it. Because it gives you just an amazingly powerful result when you're done.

All right, so we've got to evaluate the expected value of c to the T. And we're going to use the fact that T is the sum of the Tj's. And that means that c to the T equals c to the T1 times c to the T2 times c to the Tn.

The weird thing about this proof is that every step sort of makes it more complicated looking until we get to the end. So it's one of those that's hard to figure out the first time. All right, that means the expected value of c to the T is the expected value of the product of these things. Now I'm going to use the product rule for expectation.

Now, why can I use the product rule? What am I assuming to be able to do that? That they are mutually independent, that the c to the Tj's are mutually independent of each other. And that follows, because the Tj's are mutually independent.

So if a bunch of random variable are mutually independent, then their exponentiations are mutually independent. So this is by product rule for expectation and mutual independence. OK, so now we've got to evaluate the expected value of c to the Tj.

And this is where we're going to make it simpler by assuming that Tj is just a 0, 1 random variable. So the simplification comes in here. So the expected value of Tj-- well, there's two cases. Tj is 1, or it's 0. Because we made this simplification.

If it's 1, I get c to the 1-- ooh, expected value of c to the Tj. Let's get that right. It could be 1, in which case I get a contribution of c to the 1 times the probability Tj equals 1 plus the case at 0. So I get c to the 0 times the probability Tj is 0.

Well, c to the 1 is just c. c to the 0 is 1. And I'm going to rewrite Tj being 0 as 1 minus the probability Tj is 1. All right, this equals that. And of course the 1 cancels. Now I'm going to collect terms here to get 1 plus c minus 1 times the probability Tj equals 1.

OK, then I'm going to do one more step here. This is 1 plus c minus 1 times the expected value of Tj. Because if I have an indicator random variable, the expected value is the same as the probability that it's 1. Because in the other case it's 0.

And now I'm going to use the trick from last time. Remember 1 plus x is always at most e to the x from last time? None of these steps is obvious why we're doing them. But we're going to do them anyway. So this is at most e to this, c minus 1 expected value of Tj. Because 1 plus anything is at most the exponential of that.

And I'm doing this step because I got a product of these guys. And I want to put them in the exponent so I can then sum them so it gets easy.

OK, now we just plug this back in here. So that means that the expected value of c to the T is at most a product of expected value of e to the cTj is this-- e to the c minus 1 expected value of Tj. And now I can convert this to a sum in the exponent. And this is j equals 1 to n.

And what do I do to simplify that? Linearity of expectation. c minus 1 times the sum j equals 1 to n expected value of Tj. Ooh, let's see, did I? Actually, I used linearity coming out. I already used linearity. I screwed up here.

So here I used the linearity when I took the sum up here inside the expectation. I've already used linearity. What is the sum of the Tj's? T-- yeah, that's what I needed to do here.

OK, we're now almost done. We've got now an upper bound on the expected value of c to the T. And it is this. And we just plug that in back up here. So now this is at most e to the c minus 1 expected value of T over c to the c times the expected value of t. And now I just do manipulation. c to something is the same as e to the log of c times that something. So this is e to the minus c ln c expected value of T plus that.

And then I'm running out of room. That equals-- I can just pull out the expected values of T. I get e to the minus c log of c plus c minus 1 expected value of T. And that's e to the minus z expected value of T.

All right, so that's a marathon proof. It's the worst proof I think. Well, maybe minimum spanning tree was worse. But this is one of the worst proofs we've seen this year. But I wanted to show it to you. Because it's one of the most important results that we cover, certainly in probability, that can be very useful in practice. And it gives you some feel for, hey, this wasn't so obvious to do it the first time, and also some of the techniques that are used, which is really Markov's theorem. Any questions? Yeah.

AUDIENCE: Over there, you define z as 1 minus c.

PROFESSOR: Did I do it wrong?

AUDIENCE: c natural log of c, 1 minus c. Maybe it's plus c?

PROFESSOR: Oh, I've got to change the sign. Because I pulled a negative out in front. So it's got to be negative c minus 1, which means negative c plus 1. Yeah, good. Yeah, this was OK. I just made the mistake going to there. Any other questions?

OK, so the common theme here in using Markov to get Chebyshev, to get Chernoff, to get the Markov extensions, is always the same. And let me show you what that theme is. Because you can use it to get even other results.

When we're trying to figure out the probability that T is at least c times its expected value, or actually even in general, even more generally than that, the probability that A is bigger than B, even more generally, well, that's the same as the probability that f of A is bigger than f of B as long as you don't change signs. And then by Markov, this is at most the expected value of that as long as it's non-negative over that.

In Chebyshev, what function f did we use for Chebyshev in deriving Chebyshev's theorem? What was f doing in Chebyshev? Actually I probably just erased it. What operation were we doing with Chebyshev?

AUDIENCE: Variance.

PROFESSOR: Variance. And that meant we were squaring it. So the technique used to prove Chebyshev was f was the square function. For Chernoff, f is the exponentiation function, which turns out to be-- in fact, when we did it for Chernoff, that's the optimal choice of functions to get good bounds.

All right, any questions on that? All right, let's do one more example here with numbers. And this is a load balancing application for example you might have with web servers. Say you've got to build a load balancing device, and it's got to balance N jobs, B1, B2, to BN, across a set of M servers, S1, S2, to SN.

And say you're doing this for a decent sized website. So maybe N is 100,000. You get 100,000 requests a minute. And say you've got 10 servers to handle those requests. And say the requests are-- the time for the j-th request is, say, Bj takes the j-th job. The j-th request takes Lj time. And the time is the same on any server. The servers are all equivalent.

And let's assume it's normalized so that Lj is between 0 and 1. Maybe the worst job takes a second to do, let's say. And say that if you sum up the length of all the jobs, you get L. Total workload is the sum of all of them. j equals 1 to N.

And we're going to assume that the average job length is 1/4 second. So we're going to assume that the total amount of work is 25,000 seconds, say. So the average job length is 1/4 second. And the job is to assign these tasks to the 10 servers so that hopefully every server is doing L/M work, which would be 25,000/10, or 2,500 milliseconds of work, something like that. I don't know.

Because when you're doing load balancing, you want to take your load and spread it evenly and equally among all the servers. Any questions about the problem? You've got a bunch of jobs, a bunch of servers. You want to assign the jobs to the servers to balance the load. Well, what is the simplest algorithm you could think of to do this?


PROFESSOR: That's a good algorithm to do this. In practice, the first thing people do is, well, take the first N/M jobs, put them on server one, the next N/M on server two. Or they'll use something called round robin-- first job goes here, second here, third here, 10th here, back and start over. And they hope that it will balance the load. But it might well not. Because maybe every 10th job is a big one.

So what's much better to do in practice is to assign them randomly. So a job comes in. You don't even pay attention to how hard it is, how much time you think it'll take. You might not even know before you start the job how long it's going to take to complete. Give it to a random server. Don't even look at how much work that server has. Just give it to a random one.

And it turns out this does very, very well. Without knowing anything, just that simple approach does great in practice. And today, state of the art load balancers do this. We've been doing randomized kinds of thing like this at Akamai now for a decade. And it's just stunning how well it works. And so let's see why that is.

Of course we're going to use the Chernoff bound to do it. So let's let Rij be the load on server Si from job Bj. Now, if Bj is not assigned to Si, it's zero load. Because it's not even doing the work there. So we know that Rij equals the load of Bj if it's assigned to Si. And that happens with probability 1/M. The job picks one of the M servers at random. And otherwise, the load is 0. Because it's not assigned to that server. And that is probability 1 minus 1/M.

Now let's look at how much load gets assigned by this random algorithm to server i. So we'll let Ri be the sum of all the load assigned to server i. So we've got this indicator where the random variables are not 0, 1. They're 0 and whatever this load happens to be for the j-th job, at most 1. And we sum up the value for the contribution to Si over all the jobs.

So now we compute the expected value of Ri, the expected load on the i-th server. So the expected load on the i-th server is-- well, we use linearity of expectation. And the expected value of Rij-- well, 0 or Lj. It's Lj with probability 1/M. This is just now the sum of Lj over M. And the sum of Lj is just L.

So the expected load of the i-th server is the total load divided by the number of servers, which is perfect. It's optimal-- can't do better than that. It makes sense. If you assign all the jobs randomly, every server is expecting to get 1/M of the total load.

Now we want to know the probability it deviates from that, that you have too much load on the i-th server. All right, so the probability that the i-th server has c times the optimal load is at most, by Chernoff, if the jobs are independent, minus zL over M, minus z times the expected load where z is c ln c plus 1 minus c. This is Chernoff now, just straight from the formula of Chernoff, as long as these loads are mutually independent.

All right, so we know that when c gets to be-- I don't know, you pick 10% above optimal, c equals 1.1, well, we know that this is going to be a very small number. L/M is 2,500. And z, in this case, we found was 0.0048. So we get e to the minus 0.0048 times 2,500. And that is really tiny. That's less than 1 in 160,000.

So Chernoff tells us the probability that any server, a particular server, gets 10% load more than you expect is minuscule. Now, we're not quite done. That tells us the probability the first server gets 10% too much load or the problem the second server got 10% too much load, and so forth.

But what we really care about is the worst server. If all of them are good except for one, you're still in trouble. Because the one ruined your day. Because it didn't get the work done. So what do you do to bound the probability that any of the servers got too much load, any of the 10?

So what I really want to know is the probability that the worst server of M takes more than cL over M. Well, that's the probability that the first one has more than cL over M union the second one has more than cL over M union the M-th one. What do I do to get that probability, the probability of a union of events, upper bounded?


PROFESSOR: Upper bounded by the sum of the individual guys. It's the sum i equals 1 to M probability Ri greater than or equal to cL over M. And so that, each of these is at most 1 in 160,000. This is at most M/160,000. And that is equal to 1 in 16,000.

All right, so now we have the answer. The chance that any server got 10% load or more is 1 in 16,000 at most, which is why randomized load balancing is used a lot in practice. Now tomorrow, you're going to do a real world example where people use this kind of analysis, and it led to utter disaster. And the reason was that the components they were looking at were not independent.

And the example has to do with the subprime mortgage disaster. And I don't have time today to go through it all. But it's in the text, and you'll see it tomorrow. But basically what happened is that they took a whole bunch of loans, subprime loans, put them into these things called bonds, and then did an analysis about how many failures they'd expect to have. And they assumed the loans were all mutually independent.

And they applied their Chernoff bounds. And that concluded that the chances of being off from the expectation were nil, like e to the minus 380. In reality, the loans were highly dependent. When one failed, a lot tended to fail. And that led to disaster. And you'll go through some of the math on that tomorrow in recitation.