Flash and JavaScript are required for this feature.

Download the video from iTunes U or the Internet Archive.

**Description:** Covers expected value as it relates to random variables, discussing coin games, network latency, and the hat check problem.

**Speaker:** Tom Leighton

Lecture 22: Expectation I

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

PROFESSOR: OK, let's get started. Last week we talked about random variables. And this week we're going to talk about their expected value. The expected value of a random variable comes up in all sorts of applications. We're going to spend a whole week talking about it and some of next week talking about variations from the expected value. And it's probably one of the best tools you have for working with probability problems in practice.

So let's write down the definition. The expected value also has other names. It's also known as the average or the mean of a random variable. Random variable R over a probability space S is denoted by a lot of ways. We're going to use Ex to denote it, Ex of R.

And it's the sum over all possible outcomes in the sample space of the value of the random variable on that outcome times the probability of that outcome. In other words, the expected value of a random variable is just a weighted average of all possible values of the random variable where the weight is the probability of that happening.

For example, suppose we roll a fair six sided die. So the numbers that come up are 1 to 6. And we let R be the random variable denoting the outcome, 1 through 6. So the expected value of the role, we can easily compute from the definition. Well, there's a 1/6 chance that it comes out a 1. The next outcome would be a 2. That happens with 1/6 chance. 3, 4, 5, and six all happen with a 1/6 chance. If I sum up 1, 2, 3, 4 up to 6, I get 6 times 7 over 2 times the 1/6 is just 7 over 2 or 3 and 1/2.

So expected value when I roll a die, a fair six sided die, is just 3 and 1/2. And as you can see from this example, the expected value doesn't have to be attainable by one of the outcomes. You don't get a 3 and 1/2 when you roll a die. But that's the average value.

Now the expected value, the expectation, the average, the mean, they're all the same thing. They're all defined based on a random variable. They all mean the same thing. They are all different than the median. That's something that's different. So let me define the median.

Basically it is the outcome which splits the probabilities in half. In other words, you've got a 50% chance of being bigger than the median and a 50% chance of being smaller. Now precisely you can define it this way. The median of a random variable R is the value in the range of R such that the probability that the random variable is less than this median point is at most 1/2. And the probability that random variable is bigger than the median is strictly less than 1/2.

Now, some texts do it differently. They swap this less than or equal to with this one. And it could give you a different answer if you do that. And I think in the text, actually, we screwed it up. So we have to get that corrected. I think we might have put a less than or equal here, which is not the right definition. So this is the right definition that we'll use.

So using this definition, what is the median of the random variable that corresponds to rolling a die? What's the median when I roll a six sided fair die? Not 3 and 1/2, because it's got to be one of the values. The median does have to be in the realm of the random variable. It has to one of the attainable values. What's the median value when I roll a die?

AUDIENCE: Four.

PROFESSOR: Four. Let's try that out. If I plug in 4 here, the probability I'm less than 4 is 1/2. Could be 1, 2, or 3. The probability of greater than 4 is 5 or six is 1/3. That's less than 1/2. So 4 works. 3 doesn't work, because the probability I'm bigger than 3 is 1/2. So it doesn't work. And the other definition sometimes people use would turn out that three is the median.

Now, we're not going to spend any more time talking about the median. What's really important in probability is the mean. And that's because you can do a whole lot more with it in applications. Any question about the definitions so far?

All right, so now we're going to do a more interesting example. We're going to play a gambling game and we're going to analyze the expected winnings, the expected return. And this is a simple three person game that you can see played in bars and informally. But to play it I need a volunteer from the class, somebody who wants to play. You've done it. Who hasn't done it? Have you done it? You haven't played. Do you have money in your pocket? You got to borrow some money. If you don't have any money, you got to borrow some.

AUDIENCE: How much money?

PROFESSOR: Oh, I don't know, $5 or $10. Got it? All right, come on down. you're going to play against a couple of TAs. Maybe a couple of you will want to play. I'll get Nick and Martyna to play here.

Now, this is a very simple game. In each round, each of the players is going to wager $2. So I can loan you guys some money here, I guess. So you have $2? All right, I have to loan you money. I better count this out.

AUDIENCE: I think I can find the Zimbabwe dollars I had here somewhere.

PROFESSOR: I don't know if I'm taking those. Here, you can split that up. I loaned you guys $10 there. You got $2. Put it on the table. Wager your $2. Now what we're going to do is they're each going to guess the results of a coin toss. We'll put the pot together here, mix it up. You only have to do two. You can save those for now. Oh, that's your two.

All right, so we have $6 in the pot. Now, each of you is going to guess the outcome of a coin. So here's the guesses. Heads or tails, that's for you. Heads or tails, the guess for you. And you get a heads or a tails. So instead of writing it down, you're going to sort of hide those, maybe get some advice from the class, and you're going to pick one and put it on the table here as your guess.

And then one of you out there is going to flip a coin and announce the result. And then we're going to reveal their guesses and the winners split the money. And if there's no winners, well, they take their $2 back and they split the pot.

So this is a very fair game. They make their choices, you guys toss the coin, and the winners share of the pot. If there's one, winner they get $6, which is a profit of $4. Two winners, they each get three, profit of one, the other one's out two bucks. Nobody wins, they get their money back, profit 0. They all win, they get their money back, profit 0.

Is the game clear? Who's got a coin out there? You've got a coin? Very good. Now you've got to guess the coin. Don't show each other your guesses. And put your guesses on the table here. You get to pick heads or tails. Put it on the table.

[LAUGHS]

This is not a hard game. You got a heads and a tails, right? Well, yeah, did you put a heads or a tails in there? Yeah, there's one heads, one tails. No, that's blank. You don't win that way. There you go. That's the tails.

AUDIENCE: I thought you meant--

PROFESSOR: Heads. So if you want heads, you go like that.

AUDIENCE: Yeah. I thought you meant do it the other way.

PROFESSOR: There you go.

AUDIENCE: All right, I'll have to make these indistinguishable now.

PROFESSOR: Well, yeah. One of them is all ripped up now. They already guessed, so they've already done their guess. So you pick one and put it there.

AUDIENCE: This is what I was originally going to guess anyway.

PROFESSOR: So that's good. And you now flip the coin and tell us what it came out.

AUDIENCE: Heads.

PROFESSOR: Heads. All right, let's reveal your choices. All right, so there's one heads. Martyna takes all the money. Just bad luck there, right? We're going to play again. We're going to play a few times. But let's start recording what happened here. So the coin came out heads. OK, remind me your name?

AUDIENCE: Adam.

PROFESSOR: Adam gassed tails. And Martyna. Is it Y? Martyna guessed heads and Nick, he also loses at tails. So in this case Adam is out $2. Martyna got a profit of $4. And Nick is also down $2.

All right, let's try it again. So again, pick another heads or tails. Adam's busily working out probabilities here. Well you got to put your money on the table here now. You got $2. That's three. Here you go.

[LAUGHS]

There we go. Two more dollars. You can make change if you want. There we go. All right, make your choices. You got one? As long as you guys don't show him your choices, we should be good here, I think. And you got to pick one here. Very good. Can we have a coin toss please?

AUDIENCE: Heads.

PROFESSOR: Heads again. What do we got? All right, good job. OK, Adam got heads and Martyna's on a roll here. So we got to split this up. Can you make change for him here? He gets $3. So it was heads. Adam had heads, Martyna has heads, and Nick's in trouble. Nick's off another $2 of my money. Martyna is up $1 because she wagered two and got three and Adam is up $1 here for that one.

All right, let's make another selection. Heads or tails? Put the money up here, $2. Going to need another loan here, Nick?

AUDIENCE: I can throw in my keys.

PROFESSOR: No, no.

AUDIENCE: The keys to my non-existent car.

PROFESSOR: You got money? OK. All right.

AUDIENCE: I got a lot of money.

PROFESSOR: Oh, very good, that's good. That's what we like to hear here.

AUDIENCE: That's what he likes to hear.

PROFESSOR: Everybody make a choice. Heads or tails? Now, you all want to be thinking, OK, what's going on here? We're going to figure out expectations in a minute, do the tree method. Is there any catch going on? Not that we would have a catch here.

OK, put a heads or tails down. OK, coin toss please.

AUDIENCE: Tails.

PROFESSOR: Tails. What do we got? A winner, a winner, and a loser. So we got tails, tails, tails. Nick has lost another $2. Martyna's cruising and Adam is back to even. Good job.

AUDIENCE: Can I cash out now?

PROFESSOR: Well, you're only even. You've got to get ahead here. All right, let's do it again. $2. Everybody got $2 on the table? I'm going to have to loan. You got money? You got it. OK. See, my hope is Nick loses track. OK, make a choice please. Coin toss, please.

AUDIENCE: Tails.

PROFESSOR: Tails. What do we got? Heads. Martyna, Nick! Oh, for goodness sake.

AUDIENCE: She's psychic, I tell you.

PROFESSOR: Martyna wins again. You didn't make a deal with him, did you?

AUDIENCE: [INAUDIBLE].

PROFESSOR: Nick and Martyna are perfect and Adam is down again. Let's collect your money. She gets it all. Oh, she got four here. Yeah, she made four. Good point. Because she won it all. Very good. All right, let's put the next wager up. Nick, if I were you, I wouldn't do much gambling in life. [LAUGHS] You got any money left Adam?

AUDIENCE: I'm going to have a big win.

PROFESSOR: There it is. Working on the big win. There we go.

AUDIENCE: I'm running out of ones here.

PROFESSOR: Martyna can make change for you.

AUDIENCE: Thanks, Martyna. Do you have change for a 10?

PROFESSOR: All right. There we go. A couple bucks there. And then make your selections. Okey doke. It's a pretty fair game. They put the money in, the winners split the pot. You guys flip the coin after they make their selections. OK, make a selection. Martyna has hers. Adam's ready and Nick is ready. Can we have a coin toss?

AUDIENCE: Tails.

PROFESSOR: Tails. What do we got? Nick! Congratulations. Nick wins big. All right. Tails, heads, heads, tails. Nick, $4. He's climbing back. And Martyna lost for the first time $2. Poor Adam is sinking a little bit here.

All right, one more. So who collects? Nick, do you want to collect this? Start paying off the debts, but save $2 for the last round. This yours? Yeah, there's a last one. All right, two more bucks. Last round. OK, make your selections. Martyna's ready. Nick is ready. OK, Adam, what do you like? Coin toss.

AUDIENCE: Heads.

PROFESSOR: Heads. How'd you do? Oh Adam, tough break. Nick again. Nick is the sole winner. So he collects on the heads. All right, so it's plus $4 for Nick, minus 2 for Martina, minus 2 for Adam. All right, let's see how they did in total. Oh, poor Adam. Minus $6 here. How did that happen? Martyna is plus $6. Now we know how it happened. And Nick is even. Even. Yeah, so that's just a tough break for Adam. He came up here and lost $6.

AUDIENCE: Interestingly, there's $6 on the table. I think I should take that and run.

[LAUGHS]

PROFESSOR: Yeah, you probably should. But I need to get paid back from Nick here. Now obviously this is a fair game, right? So how many people think that Adam was just unlucky? A few. How many people think we just screwed him out of $6? Yeah, you're right. But let's do the analysis to see what happened here. Meanwhile, I probably should give him his $6.

AUDIENCE: [INAUDIBLE].

PROFESSOR: And you got my money. All right, so I have a gift certificate, if you would like, for playing the game here. You can have this or you can have the pocket protector. I got to say somebody's last time turned in this for one of these. The nerd pride pocket protector. So what would you like for your memento?

AUDIENCE: I'm going to take the box where [? David Chena ?] is standing.

PROFESSOR: What?

AUDIENCE: As in [INAUDIBLE].

PROFESSOR: You get one too for tossing the coin. So we'll pass this up. Here you go. Want to pass that up for our coin tosser? OK, very good. Thanks very much. Just leave the money here. Yeah. Leave my money here. Oh, we took some out of your wallet.

AUDIENCE: I got $5 money from you-- $5.

PROFESSOR: $5 from me, yeah. All right, so leave me the $5.

AUDIENCE: This is your $5.

PROFESSOR: Oh, I should give them back their $6. Well, otherwise they could report me, and I'd be in trouble. And it's all on film. That would not be good.

OK, so what we're going to do now is analyze the game. And I'm going to prove to you he was just unlucky, despite the fact that you think I screwed him here. So we're going to analyze Adam's expected winnings playing this game. So let's do that.

So we're going to make the tree, as usual. Now, the first thing we have is Adam's choice. Now, Adam were you doing anything besides sort of randomly picking? Were you trying to psych out the coin?

AUDIENCE: Apparently I wouldn't be able to bribe him from all the way down here.

PROFESSOR: Yeah, and you did lose most of it. We'll say Adam was playing 50-50. It's a reasonable assumption. And then we have Martyna and then we have Nick and their choices. And let's say they're 50-50 as well, because none of them know what coin is going to be tossed. And I'm not going to draw this bottom half of the tree, but it's totally symmetric. It just gets too deep otherwise. And then we have Nick's choices, heads, tails, heads, tails. And they're 50-50.

And then we have the coin. And I'll start drawing it down here. And the coin can be heads or tails. And we're going to assume it was a fair coin toss back there, because it looked like it was flipping a bunch of times. And you don't look like Persi Diaconis, so we're going to assume you didn't learn how to flip heads always. So those are 50-50. A little tight here. Everything is 50-50.

So the probabilities are all 1/16. I got 16 possible outcomes. 2 by 2 by 2 by 2. They're all equally likely. So all the probabilities are 1/16.

And now let's see the winnings. So we'll take Adam's game. It'll turn out to be negative. Now, in this case if Adam's heads and Martyna and Nick are heads and the coin comes up heads, they all win, they split the pot. So how much is the gain for Adam in that case? 0. Now they all guessed heads, but the coin comes out tails, so they split the pot. What's Adam's gain there? 0, because they just split the pot again. He gets his $2 back. That's 0.

Now we have a case where Nick is tails, Adam and Martyna were heads, and it comes out heads. So Nick is losing here. The pot is split by Adam and Martyna. What does Adam get as a profit? 1. He gets half the part because he splits with Martyna. That's $3 he put into, he gets 1, plus 1.

Now it turns out same scenario of guesses, but it comes out tails, so Nick wins everything. What's Adam's status here? Minus 2. He bet $2, he lost it. Then we have heads for Adam, tails for Martyna, heads for Nick. Coin comes up heads, Adam splits with Nick, he gets a net of $1. Same scenario guesses, but now it's tails, so Martina wins everything, Adam loses $2.

Now we go down to here where it's heads for Adam, tails for Martyna and Nick, and heads comes up, so Adam wins the whole thing. What does he get in this case? Plus 4. He wins the whole part of 6 minus 2. So he gets 4 net. And then lastly, in this scenario it comes up tails and Martina and Nick split the pot, Adam loses. Minus 2. And then the same thing is happening down here. Same thing. It's just everything is reversed by symmetry.

Now, they're all equally likely. So what we do to compute the expected gain is take each value times its probability and add them up. So 0 times 1/16 plus 0 times 1/16 plus 1 times 1/16 minus 2 times 1/16 and so forth. So it's easier just to add these up and multiply by 1/16. And when we do, we get 0. We get 0. If I add all these up, I get 0. The same thing for down here. So the expected gain for Adam is 0. And that is a fair game.

What do you think? Do you think we'd go to all that trouble just to play a fair game? Or do you think we're trying to set him up and take his money? Yeah.

AUDIENCE: Martyna and Nick are alternating, so the branch with the plus 4 goes away.

PROFESSOR: Oh, that's interesting. Look what happened here. Nick and Martina, well, they're opposite. That's opposite. That's opposite. That's opposite. You don't suppose they planned that, do you? And how could that possibly help that they just happened to guess opposite?

AUDIENCE: One of them always wins.

PROFESSOR: One of them always wins. And what does that mean for poor Adam?

AUDIENCE: He never takes the whole pot.

PROFESSOR: He never takes the whole pot. Well, all right. Let's see if that changed anything here. So if Nick and Martyna are always opposite, that means some of these branches can't occur. So if Martyna's heads, Nick is tails, this is out.

So this is happening with probability 1 now. So these points are at 1/8 each instead of 1/16. And these go to 0 because that branch can't happen. Same thing here. When you go tails to Martyna, Nick can't be tails, has to be heads. So these go to 1/8. These go to 0.

Now, you notice this isn't working out so well for Adam because we're putting more weight here where he's got a net negative compared to an even situation. And here he's gone from a positive situation, that's getting wiped out to a net negative.

So let's compute his probability now. And I'll get the same contribution down here. I've got 1 minus 2 plus 1 of 0 minus 2 is negative 2. I'll get a negative 2 down here times 1/8. In this case, I'll have the expected gain for Adam is going to be minus. Let's make sure I do this right. It's going to be 2 for the top and bottom times 1/8 times these guys. 1 minus 2 plus 1 minus 2. And that equals minus 2. That's minus 1/2.

So in fact, if they come up here and guess differently, who knew? Now the expected gain for Adam is minus $0.50. Every time he plays, he expects to lose $0.50 on his $2 bet. That's a lousy game for Adam to be playing, even though it seemed very fair. You see what's going on here?

Now, this kind of trick is used in all sorts of gambling games. Maybe you've probably played some of these games and may not have realized maybe somebody was using this trick against you.

For example, how many of you have been in some kind of sports betting pool? It's March Madness, you're betting on the victors in round one. It's a football pool for the weekend. You're going to guess against the spread. Who's going to win? All right, some of you have done that. All right, now everybody puts $1 into the pool. And the winner is the guy who was the most wins or games picked right. All the winners split the pot.

Now, what this says, by doing the same kind of analysis we just did, is that if you collude with one or two or three other players in the pool to always pick differently on all the games, it's going to give you an edge. Same reasoning. When they pick differently, it gives them an edge and now your expected return is bigger than 0, at the expense of the other players who just go in and are putting their picks, if we assume each one is 50-50.

In fact, a former professor of statistics here at MIT used this idea, a guy named Herman Chernoff, in the 1980s to beat the state lottery. Now, everyone knows that lotteries are the worst game around because everybody puts the money in, the state takes half, and then the winners split the pot. So it's a horrendous game. Your expected return is half of what you put in. So you expect to lose half your bet, because the state's taking half and splitting the remainder among all the participants.

Now, what Chernoff realized is that people don't bet randomly. They tend to pick the same sets of numbers. It might be a birthday. There's only so many birthdays out there. Might be number of home runs Papi's hit, his batting average, Pabelbon's ERA. Who knows? But they pick a relatively small set of numbers that tend to collapse on.

In fact, if you graph how people tend to pick. And say you're playing pick four, where you pick four numbers. And now you look at the frequency with which the numbers are picked. Very crudely it looks something like this. Every once in a while you get a hot ticket where a lot of people pick that number.

For example, if you're picking four numbers, MIT students might pick 2, 4, 6, 16. They might pick that and so it'd be a big spike for that set of four. Down the street they're probably picking in 1, 2, 3, 4. Something like that and there's a spike there.

If you knew this was the histogram of what people were picking and you knew half the pot was going to get split with the winners, what would you pick? Would you pick these? No, because now you're splitting the pot with 100 people. That's no good. You pick down here. And now when you win, you get half the pot. The state always takes half, but you don't have to split it with anybody else. And that means if you're picking down here, your expected return is positive, even with the state taking half. Because so much of the money is piled up in these things.

And so Chernoff proved that in fact, he didn't know which numbers were popular. He didn't know that. So what did he do? If you don't know where the spikes are but you're trying to avoid them and there's not very many spikes, what would you do to avoid them?

Pick random. Because if you pick random, probably you miss the spike. You're down here. With a number nobody would have thought to pick, if some random number. And he showed that, in fact, if you pick randomly, your expected gain for the lottery at the time was 7%, or $0.7 on the dollar. So positive even with the state taking half.

Now, shortly after that, you saw the proliferation of these random machines that we create random numbers. Because the state wanted to balance it out and not have this kind of a scenario. So now a lot of the picks are randomly generated in a lot of the lotteries. These things are not immediately obvious, but become very clear once you know the mathematics behind it. Any questions? Yeah.

AUDIENCE: How would a person hit a random number?

PROFESSOR: Oh, you go to your favorite computer and do a random number generator. Now, that's not perfectly random. People do get into a science of this where they're certain cosmic rays or whatever hitting the earth and the frequency they try to get random numbers out of it or certain kinds of clocks and stuff like that, the tiny low order bits.

Actually getting really independent random numbers can be challenging. A lot of the things you do with a computer generating random numbers, they're distributed nicely, but they're not mutually independent. And there's whole texts that go into how to do that for mutual independence. Getting something that's fair for one of them is not too hard. Any questions about that?

There's another example. How many people ever participated in a Super Bowl bet? OK, like maybe you're trying to guess the over under on point scored. And the person who guesses close to the total number of points wins the pot. And if there's a tie, it's shared.

So in that kind of situation, some people figured out that the average number of points scored in a Super Bowl is, say, 30. And a lot of the guesses then will cluster around 30 points. Now, if you knew this and you knew a lot of the guesses, because there may be people who cared betting, where would you make, say, this is, I don't know, 40 and this is 20, where would you guess?

You could guess here. That's a good one. Or you could guess here. Not only are you not going to share the pot, which helps you, but in this case you'd actually, because being closest, you'd capture all the scores out here. So that's really good. So these are the best guesses to make for your expected return, assuming everybody is guessing around here because they know that's the median. Yeah?

AUDIENCE: But those scores aren't likely to happen.

PROFESSOR: They're not likely to happen, but it can outweigh splitting the pot with all the people that guessed here. And you get to cover more bases. So you're right, they're less likely to happen because so many people guessed in here where it's likely. Your expected return is better out here. You may not be likely to win, but your payoff will be very large when you do.

Now, if you're in a bet or a pool with a bunch of 6042 students and they're all guessing out here, well then you want to go back home here. Then it's better. So it all depends on what that distribution looks like. All right, any more questions about this? Yeah.

AUDIENCE: [INAUDIBLE] if you're going to be playing a whole bunch of times, and if you're only playing once, wouldn't you look at the most likely result?

PROFESSOR: Say that again now?

AUDIENCE: If you're going to be only playing once wouldn't you only be concerned with the result that's most likely to show up?

PROFESSOR: Well, it depends on your strategy. If you want the expected gain to be maximized, it doesn't matter how many times you're playing. And that can be very different than maximizing your probability of winning. If you want to maximize your probability of winning, you're going to go right at the center point here.

Because if you look at the probability distribution and say the probability distribution looks like that, then you want to bet here, because that's the maximum chance of winning. But say so many people bet there, you'd split 30 ways. Well, this divided by 30 is smaller than this divided by 1. And so your expected return will be bigger out here. So two different things. Maximizing the probability of winning and maximizing your expected return. Yeah?

AUDIENCE: When you say maximizing, do you just do derivatives?

PROFESSOR: Yeah, then you could do derivatives and all that kind of stuff. That's right. Once you have the curve, you figure it out. Yeah. Yeah.

AUDIENCE: Wouldn't it be better to maximize your expected value versus the probability that you win?

PROFESSOR: Yes. In general, you want to maximize the expected return on the basis that probably over life you're doing lots of things. And that overall puts you in a better state. Now, we're going to talk about this some next time in terms of taking high risk bets with high, high payoffs. That can maximize your expected return, but you have a very decent chance of losing a lot.

And so you might not want to go there. And we'll talk about that next time we talk about variance and actually what your choice is. Because that is a fundamental choice you face. Maximizing chance of winning, maximizing expected return. And of course, tied into that is the risk of losing and what kind of risk you're willing to tolerate.

We're going to do a bunch more examples, but before I do, I want to show you some other equivalent definitions of expectation. So the expected value of a random variable, you can also compute it by summing over all possible values of the random variable. x being in the range of R. x times the probability that R equals x.

And let's see why this is true. It follows from the definition. From the definition, we know the expected value is the sum over all sample points of the value of the random variable on the sample point times its probability. And now we can organize this sum by the value that R takes.

So we're going to split this into a double sum where here we're looking first at x in the range of R. And then here we look at sample points for which R on that point is x. So this sum is equivalent to this one. Here we're just organizing all the sample points for the same value of x.

All right, now in the inner sum, I've only got values for which R w equals x, so I can just replace this with x. The same as before. Only now I've just put x instead of Rw. Now I can pull the x out since it's a constant independent of that sum.

Now here I'm summing up the probability of all the sample points for which R of the sample point is x. And that is just the probability that R equals x. That's the definition of the probability of that event. And so now my answer is some overall values x in the range of x times of probability R equals x. And that's what I was trying to prove.

OK, any questions about that? Now there are some special cases of this when the random variable is on the natural numbers. So the range of R is the natural numbers.

So a corollary. If the random variable has a range on the natural numbers, then another way to compute the expected value of r is to simply sum i equals 1 to infinity i times the probability R equals i. And the proof, well, it's really just saying the same thing here. I'm just summing over the natural numbers. And the case with 0 doesn't matter because I get 0 times the probability of 0. And that adds nothing to the sum. So I just have to sum over the positive integers in that case.

All right, there's another special case of this, which makes it even easier sometimes to compute the expected value. If R is a random variable in the natural numbers, then the expected value of R is simply the sum i equals 0 to infinity of the probability the random variable is bigger than i. Which is the same as summing i equals 1 to infinity, the probability R is bigger than or equal to i. They're the same thing.

For example, the first term here, your probability R bigger than 0. That's the same as saying R is bigger than or equal to 1 and so forth. So these are clearly the same. And the difference here is we have i times probability R equals i. Here we have probability R is bigger than I, with no i out in front. So let's see why that's true.

Well, we're going to work backwards and evaluate that sum. The sum i equal 0 to infinity probability R is bigger than i. Well, that's the probability R is bigger than 0 plus the probability R is bigger than 1 plus probability R is bigger than 2, and so forth out to infinity. Adding those up.

And now I can write this one out. That's the probability R equals 1 plus the probability R equals 2 plus the probability R equals 3 and so forth. Probability are bigger than 1 equals, well, R could be 2, R could be 3, and so forth.

R bigger than 2. Well, R could be 3, 4, and so forth. And so now if I add all these up, well, I get 1 times probability R equals 1. Two of these guys. Three of these guys. And you can see the pattern here. Before the next and so forth. And so we've shown that this value equals that value, which is by the corollary just the expected value of R. So by the corollary, we know that the expected value of R equals the sum up there. And that's the proof of the theorem.

Sometimes it's easier to use that top expression there. And as a good example, it gives a really easy way to compute the mean time to failure in a system. So let's do that.

Suppose you have a system and it fails with probability p at each step. And let's assume that the failures are mutually independent. So if the system has been going for t steps, it still will fail on step t plus 1 with probability p, no matter what's happened before.

And the question is, what's the expected number of steps before the system fails? How long is it going to live before you get a crash? And we're going to let R be that random variable. It will be the step when failure occurs, first failure.

We want to know what's the expected value of R. Mean time to failure. And we're going to use that top formula up there. Makes it a lot simpler to do that than the other definitions. Now, the probability that R is bigger than i is the same as the probability it did not fail in the first i steps. So this equals the probability of no failure in the first i steps.

Because this event, R bigger than i means that the first failure was not in the first steps, so the system was fine for the first i steps. And because of mutual independence on when failure occurs, this is simply the product that we're OK, no failure in the first step times the probability. We're OK in the second step and so forth up to the ith step. We're OK in ith step.

And this is where we're using mutual independence, because I've gone from a situation where we're OK in the first i steps. The probability of that is the product of the probabilities. We're OK in each step individually. So that needed independents.

Well, what is the probability we're OK in the first step? What's the probability of no failure on step one? 1 minus p, because p is the probability we did fail. So we're OK with probability 1 minus p. What is the probability we're OK in the second step? 1 minus p and so forth for all i steps. So this is 1 minus p to the eye.

And it's usually simpler to write that as alpha to the i where alpha equals 1 minus p. Because what we've got here now is the sum. Expected value of R is a sum, i equals 0 to infinity of that probability, which is just alpha to the i. And we all know what that sum is, right? What's that sum? 1 over 1 minus alpha.

And that plug back in the alpha to be 1 minus p is 1 over 1 minus 1 minus p. And that's very simple. That's 1 over p. So the expected time to fail, the expected step of when you're going to fail, is 1 over p where p is the failure probability.

So for example, if you have a 1% chance of failing on any step, your mean time to failure is what? 100. 1 over 0.01. So very easy to compute mean time to failure. Any questions about that? Of course, you can do it with the other definitions, but the calculations are a little more painful. Yeah.

AUDIENCE: Why are we summing it? Is it like an accumulative solution basically?

PROFESSOR: Why are we summing? Well, that's the definition. The theorem says the expected value of a random variable is the sum of the probability that it's bigger than i. That's what the theorem says. So we're just plugging into the theorem. And then the theorem was proved based on the corollary, which came from the theorem and then the definition.

So we went through a series of steps. We started with a definition of expected value, which makes sense. Then we got another way to compute it based on that definition. And then a corollary to that and then we use the corollary to prove another way of computing expected values. We went through a bunch of general steps and then we used, basically you could use this as a definition of this point for expected value. And it just says you sum those things up and you get the answer. Any other questions?

OK, there's a variation on this problem that you see all the time in sort of trick questions or in the popular press sometime, that often confuses people. People sometimes think of it as a paradox, though it's not really one. And the example's something like the following.

Say that a couple, they're going to have kids. What they really want is a baby girl. They get a boy, fine, but that's not what they're concerned about. They want to have a baby girl. And let's say that each time they have a kid it's 50-50 boy or girl. And let's say that it's mutually independent from one kid to the next, which is not true in practice. There tends to be correlation. But let's assume it's mutually independent from one kid to the next.

Now, if on the first try a couple get a girl, great, they're done, they have one kid, that's it, because they just wanted a girl. If they get a boy, OK, try again. And they keep trying again until they get the girl, even if it's 50 kids. They wait till they get the girl. And the question is, how many baby boys do you expect to get before you have the girl? what's? The expected number of boys to get the girl, then you quit? So let's do that.

So we want to know. We have the following data. The probability of a boy is 1/2. You keep having boys until you get a girl, then you quit. We're going to let R be the random variable for the number of boys. And everything is mutually independent from one child to the next.

How many people think you expect to have more boys than the one girl? You keep having boys till you get the girl. Most people think this. How many people think you expect to have fewer than one boy? Nobody. How many people think you expect to have an equal number? Expect one boy. Good, OK, so that's the answer. In fact, you expect to have just one boy. And the proof, we sort of just did it here.

Now, in this case, we're going to set it up as a mean time to failure. Same kind of thing. Now, in this case the failure mode, they want the girl, but that's when you stop, so that's the failure mode. And a working step is you have a boy and you keep having the working steps until you have failure mode and then you stop. And in this case, we're not counting the step when you stop.

So we know from that that the expected value of R is 1 over p, which is going to be 1/2. That's the number of children you have. And that equals the mean time to failure. Minus the girl. Because you count the girl as one of the children. And that's going to be the expected number of boys. So number of children you're expected to have minus the girl. This is 1 over 1/2. And that is 2 minus 1 equals 1. So you expect to have one boy before you get the girl. Any questions on that?

OK, how about this? Some couples want to have at least one of each sex. They want to have at least one boy and one girl. So they keep having children until they get one of each and then they stop. How many children do they expect to have? Somebody said two. That's a minimum number. So it's not likely to be the expected number. Three. Well, OK, who said three? Why do you think three?

AUDIENCE: Because you have to have at least the first two, so it can be greater than one. I mean, one probability is greater than one child, one probability is greater than two children. And after that it's halves which add up to one.

PROFESSOR: Yeah, that's right. That's a good proof. Very good. That'll work. Yeah, yeah.

AUDIENCE: I have a question about what you were doing before. If you were to switch your expectations for a girl and boy, wouldn't it come out to the same number but wouldn't it kind of contradict itself?

PROFESSOR: No, if you stopped as soon as you had a boy, you'd expect to have one girl before you got a boy. Totally symmetric. In this case, we're stopping when we get the girl, so we expect to have one boy. You might have none, you might have one, you might have two. and as he mentions, if you put the probabilities in there, they will work out the right way, but we just did it simpler by the mean time to failure.

And in fact, there's also another proof, what you're doing. Another way to think about how many kids you have to get one of each. Well you have the first kid. And now you have this problem, because you want one of the other sex. And you keep on trying to get one of the other sex and you expect to have two children to get one of the other sex.

So you have one to start. Whatever it is doesn't matter. And now you expect to have two more until you hit the other sex. So a total of three is the expected number of children to get one boy and one girl, at least. Any questions about expectation?

OK, let's do another example. This comes up all the time in experimental work. Probably all of you are going to have some example at some point where you're going to do a problem like this. And most all the time, people do it wrong. So let's see an example.

Say that you want to measure the latency of some communications channel. And you want to know what's the average latency. So you set up an experiment. You send a package to the channel, you measure when it started, when it got to the other end, and you record the answer, and then you do it 100 times and you take the average. And you say that's the expected latency of the channel. Very, very typical kind of thing to do, and sometimes OK to do.

So you'd do something like this. You pick a random variable D, which is going to denote the delay of a packet going through the channel. And there's, of course, some underlying distribution here, which we'll denote by f of x. And that's just the probability that D equals x. That's the probability distribution function.

And as part of doing your stuff, you notice that if I look at plot f here, if I look at x on that axis and f of x here, it looks something like this. The chance of getting the observed cases where I got a long delay were small, low probability. And almost all the time I got a short delay. So very typically you'll get a curve that looks something like that in terms of the observations of your experiment.

And say that you do your experiment 100 times and the average latency was 10 milliseconds. And sometimes you want to be really careful, so you do the whole thing again, you do another 100 times. And maybe it's nine milliseconds the next time. And that sort of confirms your belief that your first experiment was valid and that the expected latency on this channel is 10 milliseconds.

Sounds pretty good. But it can be completely wrong. And not just because you're unlucky, but because you're taking a simple method like that. And let me show you an example where it is way off.

All right, say you were just a little more sophisticated about this. And when you did your observations, you tried to figure out what this curve really looks like. And say that you, looking at the data, concluded that the probability that you have a delay of more than i milliseconds is 1 and i. That looks like this curve, as close as anything.

So from your data, you conclude this. That's a little more sophisticated conclusion, because now you've identified what you believe the distribution is, which is stronger than expectation. How would you go about figuring out the expected value if you had that information? I mean, you could average the 100 sample points. But is there any way, if you assume this, what would you do for the expected value?

Yeah, did I erase the theorem? No, it's over there. We just plug it to the theorem. The expected delay is going to be the sum of those probabilities from 1 to infinity. So we would compute from the theorem the expected delay is i equals 1 to infinity probability. Do I want to do the 0 case? Want to be sure I don't get caught up in the 0 case. We'll use the case 1 to infinity. Probability D greater than or equal to i. So let me put greater than or equal to here. That equals the sum i equals 1 to infinity of 1 over i.

What's that? What's the sum of 1 over i, i from 1 to infinity? It's infinite. Remember those harmonic number? i going from 1 to n gives you about log n. Remember that, the book stacking thing? i going from 1 to infinity, that's infinity.

So in fact, your expected latency is infinite. And you just published a paper saying it was 10 milliseconds, maybe nine. So it's very dangerous to just take a bunch of the points, add them up, and average them and say that is the answer. Especially if you have some reason to believe the distribution looks like this and it really is something like that. It could be infinite.

Now in some cases, if your distribution is very well behaved, averaging your sample points is the perfect thing to do. But it helps to know it's not necessarily the case that's a good way to go. Any ideas what went wrong? Yeah.

AUDIENCE: I have a question. What would be your probability be over i squared?

PROFESSOR: Yeah, what would happen then? If in fact, it was 1 over i squared. So you need to do this sum. Here's a good review question for the final. What method do you use to estimate that? Remember the how do we do that? Is that infinite? No. He used the integration bound.

And you'll see what this is pretty small. It'll be, what, 1 and a 1/2, 2, something like that where we can do the integration method here. So huge difference. This is O of 1. This is bounded. Probably less than 2, if we did the integration method. So huge difference here in what the outcome is.

Now, how can it be that I've got something with expected infinite value and I average 100 points and I got 10 milliseconds? Yeah.

AUDIENCE: The infinite value comes from the fact that there is a decent probability that delay is going to be huge. However, if you only take a finite number of sample points, then chances are you're not going to get any monstrous delays.

PROFESSOR: Exactly. The chance of seeing anything beyond 100 milliseconds is 1 in 100. So I'm probably not going to see. In a sample size of 100, almost surely I won't see something that takes a second, 1,000 milliseconds. But yet it's those rare sample points that are causing that sum to blow up. If I sum that from 1 to 100, I get log of 10. Pretty small. So I get log of 100, which is pretty small.

So what's happening when you do your finite sample is you're missing the big guys which are very rare, but enough to blow up your expectation. Now, you can draw two conclusions from that. One of them is just we did. The other is, well, expectation's the wrong measure. And really what we should be doing is looking at only 1,000 sample points or something like that.

But in practice of using the thing over and over again, eventually you're going to get hit with a whopper. Sometimes you'll see people take data points out when they're really the big ones. They say, oh, well, that was an anomaly. I take that out and then we compute the average. Yeah?

AUDIENCE: How much would you pay to play a game where the pair would the be the latency of the packet?

PROFESSOR: What's that?

AUDIENCE: How much do you pay to play a game where the pair would be the latency of the packet?

PROFESSOR: All right. That's a big number. If that was my losses here, that's tough. I'd bet anything up against that. To get a payoff of infinity? You'd pay $1,000 to play that game. Now, you'd want to play it for a long time to get that big payoff, right? But eventually that's what it's going to be. Any questions? People understand the issue here and what to worry about when you all do that someday? Yeah.

AUDIENCE: [INAUDIBLE].

PROFESSOR: Yeah, now here you don't know for a fact this is it. But you could model and you start seeing these. You fill in the points and you say, it looks like this, let's assume that's what it was, then here's the result. If it looks like it's 1 over i squared, you can say, let's assume that's what it is and then you get a different result. But you take the various cases and consider them to do it.

Or you could say, I take the expectation assuming I never get a point bigger than 100. And I limit it that way. And then you're safe at that point and you can get away with it. But to blindly go out there and say here it is, not so reliable.

The expected value does have a lot of useful properties. And the most important is called linearity of expectation. And we'll spend the rest of today and some of next time talking about it. And quite possibly it's one of the reasons people use it so much instead of other things you might think about.

The theorem, and this may be the most important theorem on probability. For any random variables, R1 and R2, on a probability space S, the expected value of R1 plus R2 equals the expected value of R1 plus the expected value of R2. Very simple. It's another way of saying expectation is a linear function.

The proof. No, skip the proof here. It's not hard and it's in the text. Follows pretty simply from the definition. There's a generalization for more than two random variables. So corollary. For all k in the natural numbers and k random variables R1, R2, up to Rk on the sample probability space S. The expected value of R1 plus R2 plus Rk is just the sum of the expected values. And the proof of that is by induction using that result.

And the really important thing about this is that neither of these results needs independence. It is true whether or not the Ri are independent. Pretty much everything we do in probability to manipulate random variables needs them to be independent. You don't need that here. And that'll make it very powerful.

So let's do an example. Say I roll two fair dice. Six sided dice. Not necessarily independent. R1 is the outcome on the first die. And R2 will be the outcome on the second one. And I'm interested in the sum of the dice. So with that R it'd be R1 plus R2. And I want to know the expected value of the sum of the dice when I roll them.

Now, if I didn't use that theorem I'd compute the tree in the sample space. I'd get 36 possible outcomes, take the probability of each. It'd take you a little while to do it, but using linearity of expectation, it's easy. It's expected value of R1 plus the expected value of R2. Each of these we already figured out is 3 and 1/2. And so the answer is 7. So if you roll a pair of dice, whether or not they are independent, the expected sum is seven. Any questions about linearity of expectation? Yeah.

AUDIENCE: [INAUDIBLE].

PROFESSOR: This one? [INAUDIBLE].

AUDIENCE: [INAUDIBLE].

PROFESSOR: No, I mean pluses here. Here? I'm computing the sum of the random variables. So this could be a 1, that could be a 10, this could be a 3. So I'm compute the expected value of the sum, just like when I rolled two, dice I'm taking the expected value of the sum of the dice. Any other questions?

Now we're going to do a little trickier problem that uses linearity of expectation. Yeah?

AUDIENCE: [INAUDIBLE] sets, like when we're looking at the failure. Couldn't we add the two cases, like R1 being a girl and R2 being a boy?

PROFESSOR: Well, what would it mean to sum-- so what is the random variable? R1 is the case when you get a boy. So it's an indicator for getting a boy. R2 is the indicator for getting a girl. R1 plus R2 is by definition 1 then, because you got a boy or a girl. One of them had to happen. And the expected value would be one, But that's a different kind of game. But we are going to start using this in sophisticated ways to make calculations be easier for things like that. so

This problem we call the hat check problem. And the idea here behind this is say that you have n men at a restaurant having dinner. And when they come into the restaurant, they check their hats in the coat room.

Then something goes wrong in the coat room and the hats are all scrambled up randomly. Let's say a random permutation of the hats. So the men come back to get their hats and they get a random hat coming back. So each man gets a random hat back after dinner.

And the question is, what is the expected number of men to get the right hat back? So we let R be the random variable that says the number of men to get the right hat, their happen. And we want to know the expected value of R. Now, from the definition, that's just the sum of all possibilities. K from 1 to n. K times the probability R equals k. So using one of the definitions, we could compute the expected value this way. In fact, if you were to be assigned this on a test or on homework, that's probably how you'd start something like that.

And then, well, the next step you'd take would be to figure out what's the probability that exactly K men get the right hat back. In fact, we actually asked this once before we started doing it in class. And it was really hard if you went down this path. Because if you spent all night with your buddies, you would maybe get to the conclusion that probability R equals K is this. 1 over K factorial times n minus K down here for K less than or equal to n minus 2. And 1 over n factorial if K equals n minus 1 or n.

Then you would plug that nasty looking thing into there. So multiply by K. And you'd have to sum it up. And Lord help you. That's just a nightmare to do. You'd have a very hard time getting the answer. If you doubt that, try it. But that would be a natural way to proceed.

But it turns out there is a trivial way to get the answer. And this is a very powerful technique using linearity of expectation. And for sure there will be a problem on the final exam just like this. And so if you go down this path, which is a natural first path, it may take you the rest of the day to solve it. But the method I'm going to show you will take you a couple minutes to do it.

Now, the trick is to use linearity of expectations. So the problem is there's no sum here. So what we need to do is we're going to express R as the sum of random variables. And the way we're going to do that is as follows. And it's not obvious, but once you see it, it's easy to keep using it.

We let R be the sum of R1 plus R2 plus Rn. And Ri is going to tell us the event. This is sort of what you were talking about before with the event of a boy or event of a girl. This is going to be the event of the ith man gets his right hat back. So it's an indicator random variable. It's 1 if the ith man gets the right hat. And it's 0 if he doesn't.

So whenever the ith guy gets his hat back, that counts as one, and now you can see why this sum works. R is the number of men to get the right hat back. And it basically there's a one counted in here every time a guy gets his hat back, and 0 if he doesn't. So this sum is counting how many men got their right hat back. Non-obvious the first time, gets really simple the fourth or fifth time. We'll try to do a couple of them today.

All right, now the expected value of R is easy. It's just by linearity of expectation, expected value of R1 and so forth plus the expected value of Rn. The expected value of an indicator random variable is just the probability that it's 1. Right it's 1 times the probability, it's 0 times the probability of 0. That's just the probability that it's 1.

What's the probability that the first man gets the right hat back? 1 over n. There's n hats. He gets a random one. What's the probability that the second man gets his hat back? Not 1 over n minus 1. He's coming in whether he's first, second, or last. He gets a random hat. 1 in n chance it's his.

What's the probability the last man gets the right hat back? One over n. Doesn't really matter if he's first or last. Just sort of tricks up a little bit. He's getting a random hat back. So it's 1 over n. I got n of each, 1 over n, so what's the expected number of men to get the right hat back? One. Now, the math doesn't get much easier than that.

Now, the amazing thing is we just proved that if we take that mess, stick it in here and sum it up, what answer do we get? 1. That is certainly not obvious, but that is a consequence of everything we've just done. We've just given a probability proof of that fact.

But the nice thing here is there's actually even more powerful. Did I need to assume that it was a random permutation of hats like I would need to assume for this? No independence is needed. In fact, there's all sorts of distributions that will give the same expected value. All I need is that each person gets the right hat back with probability 1 over n. In fact, this is an example of a different distribution for which the result is the same.

Say you're at a Chinese restaurant. And you know they have the thing that spins in the middle of the table? Say that you each order an appetizer. There's n people and they're around a big circular table. Everybody gets their appetizer. Wonton soup, whatever.

And then there's always the joker who spins the thing and spins around and then it stops and now you've got a random appetizer in front of you. In this case, we want to know what's the expected number of people to get the right appetizer back. Not the other guys wonton soup, but yours. That's a different probability space because there's only n sample points, n places where the thing could have stopped. Not n factorial like the hats.

Well, does the analysis change? Exactly the same. Ri is the indicator variable for the ith person gets the right appetizer back. Linearity of expectation. The expected value of the indicator variable is just the probability that it's 1. And the probability that any person gets the right appetizer is 1 over n.

So the answer is the same. The expected number of people to get the right appetizer back is 1. So totally different probability spaces, exactly the same analysis and answer. OK, so we'll do more of this next time.

## Free Downloads

### Video

- iTunes U (MP4 - 183MB)
- Internet Archive (MP4 - 183MB)

### Subtitle

- English - US (SRT)

## Welcome!

This is one of over 2,200 courses on OCW. Find materials for this course in the pages linked along the left.

**MIT OpenCourseWare** is a free & open publication of material from thousands of MIT courses, covering the entire MIT curriculum.

**No enrollment or registration.** Freely browse and use OCW materials at your own pace. There's no signup, and no start or end dates.

**Knowledge is your reward.** Use OCW to guide your own life-long learning, or to teach others. We don't offer credit or certification for using OCW.

**Made for sharing**. Download files for later. Send to friends and colleagues. Modify, remix, and reuse (just remember to cite OCW as the source.)

Learn more at Get Started with MIT OpenCourseWare