# Lecture 7: Multiple Discrete Random Variables: Expectations, Conditioning, Independence

Flash and JavaScript are required for this feature.

Description: In this lecture, the professor discussed multiple random variables, expectations, and binomial distribution.

Instructor: John Tsitsiklis

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

PROFESSOR: OK, good morning. So today, we're going to have a fairly packed lecture. We are going to conclude with chapter two, discrete random variables. And we will be talking mostly about multiple random variables. And this is also the last lecture as far as quiz one is concerned. So it's going to cover the material until today, and of course the next recitation and tutorial as well.

OK, so we're going to review quickly what we introduced at the end of last lecture, where we talked about the joint PMF of two random variables. We're going to talk about the case of more than two random variables as well. We're going to talk about the familiar concepts of conditioning and independence, but applied to random variables instead of events. We're going to look at the expectations once more, talk about a few properties that they have, and then solve a couple of problems and calculate a few things in somewhat clever ways.

So the first point I want to make is that, to a large extent, whatever is happening in our chapter on discrete random variables is just an exercise in notation. There is stuff and concepts that you are already familiar with-- probabilities, probabilities of two things happening, conditional probabilities. And all that we're doing, to some extent, is rewriting those familiar concepts in new notation. So for example, this is the joint PMF of two random variable. It gives us, for any pair or possible values of those random variables, the probability that that pair occurs simultaneously. So it's the probability that simultaneously x takes that value, and y takes that other value.

And similarly, we have the notion of the conditional PMF, which is just a list of the -- condition of -- the various conditional probabilities of interest, conditional probability that one random variable takes this value given that the other random variable takes that value. Now, a remark about conditional probabilities. Conditional probabilities generally are like ordinary probabilities. You condition on something particular. So here we condition on a particular y. So think of little y as a fixed quantity. And then look at this as a function of x. So given that y, which we condition on, given our new universe, we're considering the various possibilities for x and the probabilities that they have.

Now, the probabilities over all x's, of course, needs to add to 1. So we should have a relation of this kind. So they're just like ordinary probabilities over the different x's in a universe where we are told the value of the random variable y. Now, how are these related? So we call these the marginal, these the joint, these the conditional. And there are some relations between these. For example, to find the marginal from the joint, it's pretty straightforward. The probability that x takes a particular value is the sum of the probabilities of all of the different ways that this particular value may occur.

What are the different ways? Well, it may occur together with a certain y, or together with some other y, or together with some other y. So you look at all the possible y's that can go together with this x, and add the probabilities of all of those pairs for which we get this particular value of x. And then there's a relation between that connects these two probabilities with the conditional probability. And it's this relation. It's nothing new. It's just new notation for writing what we already know, that the probability of two things happening is the probability that the first thing happens, and then given that the first thing happens, the probability that the second one happened.

So how do we go from one to the other? Think of A as being the event that X takes the value, little x, and B being the event that Y takes the value, little y. So the joint probability is the probability that these two things happen simultaneously. It's the probability that X takes this value times the conditional probability that Y takes this value, given that X took that first value. So it's the familiar multiplication rule, but just transcribed in our new notation. So nothing new so far.

OK, why did we go through this exercise and this notation? It's because in the experiments where we're interested in the real world, typically there's going to be lots of uncertain quantities. There's going to be multiple random variables. And we want to be able to talk about them simultaneously. Okay. Why two and not more than two? How about three random variables? Well, if you understand what's going on in this slide, you should be able to kind of automatically generalize this to the case of multiple random variables.

So for example, if we have three random variables, X, Y, and Z, and you see an expression like this, it should be clear what it means. It's the probability that X takes this value and simultaneously Y takes that value and simultaneously Z takes that value. I guess that's an uppercase Z here, that's a lowercase z. And if I ask you to find the marginal of X, if I tell you the joint PMF of the three random variables and I ask you for this value, how would you find it? Well, you will try to generalize this relation here. The probability that x occurs is the sum of the probabilities of all events that make X to take that particular value.

So what are all the events? Well, this particular x can happen together with some y and some z. We don't care which y and z. Any y and z will do. So when we consider all possibilities, we need to add here over all possible values of y's and z's. So consider all triples, x, y, z. Fix x and consider all the possibilities for the remaining variables, y and z, add these up, and that gives you the marginal PMF of X. And then there's other things that you can do. This is the multiplication rule for two events.

We saw back in chapter one that there's a multiplication rule when you talk about more than two events. And you can write a chain of conditional probabilities. We can certainly do the same in our new notation. So let's look at this rule up here. Multiplication rule for three random variables, what does it say? The probability of three things happening simultaneously, X, Y, Z taking specific values, little x, little y, little z, that probability is the probability that the first thing happens, that X takes that value.

Given that X takes that value, we multiply it with the probability that Y takes also a certain value. And now, given that X and Y have taken those particular values, we multiply with a conditional probability that the third thing happens, given that the first two things happen. So this is just the multiplication rule for three events, which would be probability of A intersection B intersection C equals-- you know the rest of the formula. You just rewrite this formula in PMF notation. Probability of A intersection B intersection C is the probability of A, which corresponds to this term, times the probability of B given A, times the probability of C given A and B.

So what else is there that's left from chapter one that we can or should generalize to random variables? Well, there's the notion of independence. So let's define what independence means. Instead of talking about just two random variables, let's go directly to the case of multiple random variables. When we talked about events, things were a little complicated. We had a simple definition for independence of two events. Two events are independent if the probability of both is equal to the product of the probabilities. But for three events, it was kind of messy. We needed to write down lots of conditions.

For random variables, things in some sense are a little simpler. We only need to write down one formula and take this as the definition of independence. Three random variables are independent if and only if, by definition, their joint probability mass function factors out into individual probability mass functions. So the probability that all three things happen is the product of the individual probabilities that each one of these three things is happening. So independence means mathematically that you can just multiply probabilities to get to the probability of several things happening simultaneously.

So with three events, we have to write a huge number of equations, of equalities that have to hold. How can it be that with random variables we can only manage with one equality? Well, the catch is that this is not really just one equality. We require this to be true for every little x, y, and z. So in some sense, this is a bunch of conditions that are being put on the joint PMF, a bunch of conditions that we need to check. So this is the mathematical definition. What is the intuitive content of this definition? The intuitive content is the same as for events. Random variables are independent if knowing something about the realized values of some of these random variables does not change our beliefs about the likelihood of various values for the remaining random variables.

So independence would translate, for example, to a condition such as the conditional PMF of X , given y, should be equal to the marginal PMF of X. What is this saying? That you have some original beliefs about how likely it is for X to take this value. Now, someone comes and tells you that Y took on a certain value. This causes you, in principle, to revise your beliefs. And your new beliefs will be captured by the conditional PMF, or the conditional probabilities. Independence means that your revised beliefs actually will be the same as your original beliefs. Telling you information about the value of Y doesn't change what you expect for the random variable X.

Why didn't we use this definition for independence? Well, because this definition only makes sense when this conditional is well-defined. And this conditional is only well-defined if the events that Y takes on that particular value has positive probability. We cannot condition on events that have zero probability, so conditional probabilities are only defined for y's that are likely to occur, that have a positive probability.

Now, similarly, with multiple random variables, if they're independent, you would have relations such as the conditional of X, given y and z, should be the same as the marginal of X. What is this saying? Again, that if I tell you the values, the realized values of random variables Y and Z, this is not going to change your beliefs about how likely x is to occur. Whatever you believed in the beginning, you're going to believe the same thing afterwards. So it's important to keep that intuition in mind, because sometimes this way you can tell whether random variables are independent without having to do calculations and to check this formula.

OK, so let's check our concepts with a simple example. Let's look at two random variables that are discrete, take values between one and for each. And this is a table that gives us the joint PMF. So it tells us the probability that X equals to 2 and Y equals to 1 happening simultaneously. It's an event that has probability 1/20. Are these two random variables independent? You can try to check a condition like this. But can we tell directly from the table?

If I tell you a value of Y, could that give you useful information about X? Certainly. If I tell you that Y is equal to 1, this tells you that X must be equal to 2. But if I tell you that Y was equal to 3, this tells you that, still, X could be anything. So telling you the value of Y kind of changes what you expect or what you consider possible for the values of the other random variable. So by just inspecting here, we can tell that the random variables are not independent.

Okay. What's the other concept we introduced in chapter one? We introduced the concept of conditional independence. And conditional independence is like ordinary independence but applied to a conditional universe where we're given some information. So suppose someone tells you that the outcome of the experiment is such that X is less than or equal to 2 and Y is larger than or equal to 3. So we are given the information that we now live inside this universe.

So what happens inside this universe? Inside this universe, our random variables are going to have a new joint PMF which is conditioned on the event that we were told that it has occurred. So let A correspond to this sort of event here. And now we're dealing with conditional probabilities. What are those conditional probabilities? We can put them in a table. So it's a two by two table, since we only have two possible values. What are they going to be?

Well, these probabilities show up in the ratios 1, 2, 2, and 4. Those ratios have to stay the same. The probabilities need to add up to one. So what should the denominators be since these numbers add up to nine? These are the conditional probabilities. So this is the conditional PMF in this example. Now, in this conditional universe, is x independent from y? If I tell you that y takes this value, so we live in this universe, what do you know about x? What you know about x is at this value is twice as likely as that value. If I condition on y taking this value, so we're living here, what do you know about x? What you know about x is that this value is twice as likely as that value.

So it's the same. Whether we live here or we live there, this x is twice as likely as that x. So the conditional PMF in this new universe, the conditional PMF of X given y, in the new universe is the same as the marginal PMF of X, but of course in the new universe. So no matter what y is, the conditional PMF of X is the same. And that conditional PMF is 1/3 and 2/3. This is the conditional PMF of X in the new universe no matter what y occurs.

So Y does not give us any information about X, doesn't cause us to change our beliefs inside this little universe. And therefore the two random variables are independent. Now, the other way that you can verify that we have independence is to find the marginal PMFs of the two random variables. The marginal PMF of X, you find it by adding those two terms. You get 1/3. Adding those two terms, you get 2/3. Marginal PMF of Y, you find it, you add these two terms, and you get 1/3. And the marginal PMF of Y here is going to be 2/3.

And then you ask the question, is the joint the product of the marginals? And indeed it is. This times this gives you 1/9. This times this gives you 2/9. So the values in the table with the joint PMFs is the product of the marginal PMFs of X and Y in this universe, so the two random variables are independent inside this universe. So we say that they're conditionally independent. All right.

Now let's move to the new topic, to the new concept that we introduce in this chapter, which is the concept of expectations. So what are the things to know here? One is the general idea. The way to think about expectations is that it's something like the average value for random variable if you do an experiment over and over, and if you interpret probabilities as frequencies. So you get x's over and over with a certain frequency -- P(x) -- a particular value, little x, gets realized. And each time that this happens, you get x dollars. How many dollars do you get on the average? Well, this formula gives you that particular average.

So first thing we do is to write down a definition for this sort of concept. But then the other things you need to know is how to calculate expectations using shortcuts sometimes, and what properties they have. The most important shortcut there is is that, if you want to calculate the expected value, the average value for a random variable, you do not need to find the PMF of that random variable. But you can work directly with the x's and the y's. So you do the experiment over and over. The outcome of the experiment is a pair (x,y). And each time that a certain (x,y) happens, you get so many dollars.

So this fraction of the time, a certain (x,y) happens. And that fraction of the time, you get so many dollars, so this is the average number of dollars that you get. So what you end up, since it is the average, then that means that it corresponds to the expected value. Now, this is something that, of course, needs a little bit of mathematical proof. But this is just a different way of accounting. And it turns out we give you the right answer. And it's a very useful shortcut.

Now, when we're talking about functions of random variables, in general, we cannot speak just about averages. That is, the expected value of a function of a random variable is not the same as the function of the expected values. A function of averages is not the same as the average of a function. So in general, this is not true. But what it's important to know is to know the exceptions to this rule. And the important exceptions are mainly two. One is the case of linear functions of a random variable. We discussed this last time. So the expected value of temperature in Celsius is, you first find the expected value of temperature in Fahrenheit, and then you do the conversion to Celsius. So whether you first average and then do the conversion to the new units or not, it shouldn't matter when you get the result.

The other property that turns out to be true when you talk about multiple random variables is that expectation still behaves linearly. So let X, Y, and Z be the score of a random student at each one of the three sections of the SAT. So the overall SAT score is X plus Y plus Z. This is the average score, the average total SAT score. Another way to calculate that average is to look at the first section of the SAT and see what was the average. Look at the second section, look at what was the average, and so the third, and add the averages. So you can do the averages for each section separately, add the averages, or you can find total scores for each student and average them.

So I guess you probably believe that this is correct if you talk just about averaging scores. Since expectations are just the variation of averages, it turns out that this is also true in general. And the derivation of this is very simple, based on the expected value rule. And you can look at it in the notes. So this is one exception, which is linearity. The second important exception is the case of independent random variables, that the product of two random variables has an expectation which is the product of the expectations. In general, this is not true. But for the case where we have independence, the expectation works out as follows. Using the expected value rule, this is how you calculate the expected value of a function of a random variable. So think of this as being your g(X, Y) and this being your g(little x, y).

So this is something that's generally true. Now, if we have independence, then the PMFs factor out, and then you can separate this sum by bringing together the x terms, bring them outside the y summation. And you find that this is the same as expected value of X times the expected value of Y. So independence is used in this step here. OK, now what if X and Y are independent, but instead of taking the expectation of X times Y, we take the expectation of the product of two functions of X and Y? I claim that the expected value of the product is still going to be the product of the expected values.

How do we show that? We could show it by just redoing this derivation here. Instead of X and Y, we would have g(X) and h(Y), so the algebra goes through. But there's a better way to think about it which is more conceptual. And here's the idea. If X and Y are independent, what does it mean? X does not convey any information about Y. If X conveys no information about Y, does X convey information about h(Y)? No. If X tells me nothing about Y, nothing new, it shouldn't tell me anything about h(Y).

Now, if X tells me nothing about h of h(Y), could g(X) tell me something about h(Y)? No. So the idea is that, if X is unrelated to Y, doesn't have any useful information, then g(X) could not have any useful information for h(Y). So if X and Y are independent, then g(X) and h(Y) are also independent. So this is something that one can try to prove mathematically, but it's more important to understand conceptually why this is so. It's in terms of conveying information.

So if X tells me nothing about Y, X cannot tell me anything about Y cubed, or X cannot tell me anything by Y squared, and so on. That's the idea. And once we are convinced that g(X) and h(Y) are independent, then we can apply our previous rule, that for independent random variables, expectations multiply the right way. Apply the previous rule, but apply it now to these two independent random variables. And we get the conclusion that we wanted.

Now, besides expectations, we also introduced the concept of the variance. And if you remember the definition of the variance, let me write down the formula for the variance of aX. It's the expected value of the random variable that we're looking at minus the expected value of the random variable that we're looking at. So this is the difference of the random variable from its mean. And we take that difference and square it, so it's the squared distance from the mean, and then take expectations of the whole thing.

So when you look at that expression, you realize that a can be pulled out of those expressions. And because there is a squared, when you pull out the a, it's going to come out as an a-squared. So that gives us the rule for finding the variance of a scale or product of a random variable. The variance captures the idea of how wide, how spread out a certain distribution is. Bigger variance means it's more spread out.

Now, if you take a random variable and the constants to it, what does it do to its distribution? It just shifts it, but it doesn't change its width. So intuitively it means that the variance should not change. You can check that mathematically, but it should also make sense intuitively. So the variance, when you add the constant, does not change. Now, can you add variances is the way we added expectations? Does variance behave linearly? It turns out that not always. Here, we need a condition. It's only in special cases-- for example, when the two random variables are independent-- that you can add variances. The variance of the sum is the sum of the variances if X and Y are independent.

The derivation of this is, again, very short and simple. We'll skip it, but it's an important fact to remember. Now, to appreciate why this equality is not true always, we can think of some extreme examples. Suppose that X is the same as Y. What's going to be the variance of X plus Y? Well, X plus Y, in this case, is the same as 2X, so we're going to get 4 times the variance of X, which is different than the variance of X plus the variance of X.

So that expression would give us twice the variance of X. But actually now it's 4 times the variance of X. The other extreme would be if X is equal to -Y. Then the variance is the variance of the random variable, which is always equal to 0. Now, a random variable which is always equal to 0 has no uncertainty. It is always equal to its mean value, so the variance, in this case, turns out to be 0.

So in both of these cases, of course we have random variables that are extremely dependent. Why are they dependent? Because if I tell you something about Y, it tells you an awful lot about the value of X. There's a lot of information about X if I tell you Y, in this case or in that case. And finally, a short drill. If I tell you that the random variables are independent and you want to calculate the variance of a linear combination of this kind, then how do you argue? You argue that, since X and Y are independent, this means that X and 3Y are also independent. X has no information about Y, so X has no information about -Y. X has no information about -Y, so X should not have any information about -3Y. So X and -3Y are independent.

So the variance of Z should be the variance of X plus the variance of -3Y, which is the variance of X plus 9 times the variance of Y. The important thing to note here is that no matter what happens, you end up getting a plus here, not a minus. So that's the sort of important thing to remember in this type of calculation. So this has been all concepts, reviews, new concepts and all that. It's the usual fire hose. Now let's use them to do something useful finally.

So let's revisit our old example, the binomial distribution, which counts the number of successes in independent trials of a coin. It's a biased coin that has a probability of heads, or probability of success, equal to p at each trial. Finally, we can go through the exercise of calculating the expected value of this random variable. And there's the way of calculating that expectation that would be the favorite of those people who enjoy algebra, which is to write down the definition of the expected value. We add over all possible values of the random variable, over all the possible k's, and weigh them according to the probabilities that this particular k occurs. The probability that X takes on a particular value k is, of course, the binomial PMF, which is this familiar formula.

Clearly, that would be a messy and challenging calculation. Can we find a shortcut? There's a very clever trick. There's lots of problems in probability that you can approach really nicely by breaking up the random variable of interest into a sum of simpler and more manageable random variables. And if you can make it to be a sum of random variables that are just 0's or 1's, so much the better. Life is easier. Random variables that take values 0 or 1, we call them indicator variables. They indicate whether an event has occurred or not.

In this case, we look at each coin flip one at a time. For the i-th flip, if it resulted in heads or a success, we record it 1. If not, we record it 0. And then we look at the random variable. If we take the sum of the Xi's, what is it going to be? We add one each time that we get a success, so the sum is going to be the total number of successes. So we break up the random variable of interest as a sum of really nice and simple random variables.

And now we can use the linearity of expectations. We're going to find the expectation of X by finding the expectation of the Xi's and then adding the expectations. What's the expected value of Xi? Well, Xi takes the value 1 with probability p, and takes the value 0 with probability 1-p. So the expected value of Xi is just p. So the expected value of X is going to be just n times p. Because X is the sum of n terms, each one of which has expectation p, the expected value of the sum is the sum of the expected values. So I guess that's a pretty good shortcut for doing this horrendous calculation up there.

So in case you didn't realize it, that's what we just established without doing any algebra. Good. How about the variance of X, of Xi? Two ways to calculate it. One is by using directly the formula for the variance, which would be -- let's see what it would be. With probability p, you get a 1. And in this case, you are so far from the mean. That's your squared distance from the mean. With probability 1-p, you get a 0, which is so far away from the mean. And then you can simplify that formula and get an answer.

How about a slightly easier way of doing it. Instead of doing the algebra here, let me indicate the slightly easier way. We have a formula for the variance that tells us that we can find the variance by proceeding this way. That's a formula that's generally true for variances. Why is this easier? What's the expected value of Xi squared? Backtrack. What is Xi squared, after all? It's the same thing as Xi. Since Xi takes value 0 and 1, Xi squared also takes the same values, 0 and 1. So the expected value of Xi squared is the same as the expected value of Xi, which is equal to p. And the expected value of Xi squared is p squared, so we get the final answer, p times (1-p).

If you were to work through and do the cancellations in this messy expression here, after one line you would also get to the same formula. But this sort of illustrates that working with this formula for the variance, sometimes things work out a little faster. Finally, are we in business? Can we calculate the variance of the random variable X as well? Well, we have the rule that for independent random variables, the variance of the sum is the sum of the variances. So to find the variance of X, we just need to add the variances of the Xi's. We have n Xi's, and each one of them has variance p_n times (1-p). And we are done.

So this way, we have calculated both the mean and the variance of the binomial random variable. It's interesting to look at this particular formula and see what it tells us. If you are to plot the variance of X as a function of p, it has this shape. And the maximum is here at 1/2. p times (1-p) is 0 when p is equal to 0. And when p equals to 1, it's a quadratic, so it must have this particular shape. So what does it tell us? If you think about variance as a measure of uncertainty, it tells you that coin flips are most uncertain when your coin is fair. When p is equal to 1/2, that's when you have the most randomness.

And this is kind of intuitive. if on the other hand I tell you that the coin is extremely biased, p very close to 1, which means it almost always gives you heads, then that would be a case of low variance. There's low variability in the results. There's little uncertainty about what's going to happen. It's going to be mostly heads with some occasional tails. So p equals 1/2. Fair coin, that's the coin which is the most uncertain of all coins, in some sense. And it corresponds to the biggest variance. It corresponds to an X that has the widest distribution.

Now that we're on a roll and we can calculate such hugely complicated sums in simple ways, let us try to push our luck and do a problem with this flavor, but a little harder than that. So you go to one of those old-fashioned cocktail parties. All males at least will have those standard big hats which look identical. They check them in when they walk in. And when they walk out, since they look pretty identical, they just pick a random hat and go home. So n people, they pick their hats completely at random, quote, unquote, and then leave. And the question is, to say something about the number of people who end up, by accident or by luck, to get back their own hat, the exact same hat that they checked in.

OK, first what do we mean completely at random? Completely at random, we basically mean that any permutation of the hats is equally likely. Any way of distributing those n hats to the n people, any particular way is as likely as any other way. So there's complete symmetry between hats and people. So what we want to do is to calculate the expected value and the variance of this random variable X. Let's start with the expected value.

Let's reuse the trick from the binomial case. So total number of hats picked, we're going to think of total number of hats picked as a sum of (0, 1) random variables. X1 tells us whether person 1 got their own hat back. If they did, we record a 1. X2, the same thing. By adding all X's is how many 1's did we get, which counts how many people selected their own hats. So we broke down the random variable of interest, the number of people who get their own hats back, as a sum of random variables. And these random variables, again, are easy to handle, because they're binary. The only take two values.

What's the probability that Xi is equal to 1, the i-th person has a probability that they get their own hat? There's n hats by symmetry. The chance is that they end up getting their own hat, as opposed to any one of the other n - 1 hats, is going to be 1/n. So what's the expected value of Xi? It's one times 1/n. With probability 1/n, you get your own hat, or you get a value of 0 with probability 1-1/n, which is 1/n.

All right, so we got the expected value of the Xi's. And remember, we want to do is to calculate the expected value of X by using this decomposition? Are the random variables Xi independent of each other? You can try to answer that question by writing down a joint PMF for the X's, but I'm sure that you will not succeed. But can you think intuitively? If I tell you information about some of the Xi's, does it give you information about the remaining ones? Yeah. If I tell you that out of 10 people, 9 of them got their own hat back, does that tell you something about the 10th person? Yes. If 9 got their own hat, then the 10th must also have gotten their own hat back.

So the first 9 random variables tell you something about the 10th one. And conveying information of this sort, that's the case of dependence. All right, so the random variables are not independent. Are we stuck? Can we still calculate the expected value of X? Yes, we can. And the reason we can is that expectations are linear. Expectation of a sum of random variables is the sum of the expectations. And that's always true. There's no independence assumption that's being used to apply that rule. So we have that the expected value of X is the sum of the expected value of the Xi's. And this is a property that's always true. You don't need independence. You don't care. So we're adding n terms, each one of which has expected value 1/n. And the final answer is 1.

So out of the 100 people who selected hats at random, on the average, you expect only one of them to end up getting their own hat back. Very good. So since we are succeeding so far, let's try to see if we can succeed in calculating the variance as well. And of course, we will. But it's going to be a little more complicated. The reason it's going to be a little more complicated is because the Xi's are not independent, so the variance of the sum is not the same as the sum of the variances.

So it's not enough to find the variances of the Xi's. We'll have to do more work. And here's what's involved. Let's start with the general formula for the variance, which, as I mentioned before, it's usually the simpler way to go about calculating variances. So we need to calculate the expected value for X-squared, and subtract from it the expectation squared. Well, we already found the expected value of X. It's equal to 1. So 1-squared gives us just 1. So we're left with the task of calculating the expected value of X-squared, the random variable X-squared. Let's try to follow the same idea. Write this messy random variable, X-squared, as a sum of hopefully simpler random variables.

So X is the sum of the Xi's, so you square both sides of this. And then you expand the right-hand side. When you expand the right-hand side, you get the squares of the terms that appear here. And then you get all the cross-terms. For every pair of (i,j) that are different, i different than j, you're going to have a cross-term in the sum. So now, in order to calculate the expected value of X-squared, what does our task reduce to? It reduces to calculating the expected value of this term and calculating the expected value of that term. So let's do them one at a time.

Expected value of Xi squared, what is it going to be? Same trick as before. Xi takes value 0 or 1, so Xi squared takes just the same values, 0 or 1. So that's the easy one. That's the same as expected value of Xi, which we already know to be 1/n. So this gives us a first contribution down here. The expected value of this term is going to be what? We have n terms in the summation. And each one of these terms has an expectation of 1/n. So we did a piece of the puzzle. So now let's deal with the second piece of the puzzle.

Let's find the expected value of Xi times Xj. Now by symmetry, the expected value of Xi times Xj is going to be the same no matter what i and j you see. So let's just think about X1 and X2 and try to find the expected value of X1 and X2. X1 times X2 is a random variable. What values does it take? Only 0 or 1? Since X1 and X2 are 0 or 1, their product can only take the values of 0 or 1. So to find the probability distribution of this random variable, it's just sufficient to find the probability that it takes the value of 1. Now, what does X1 times X2 equal to 1 mean? It means that X1 was 1 and X2 was 1. The only way that you can get a product of 1 is if both of them turned out to be 1's.

So that's the same as saying, persons 1 and 2 both picked their own hats. The probability that person 1 and person 2 both pick their own hats is the probability of two things happening, which is the product of the first thing happening times the conditional probability of the second, given that the first happened. And in words, this is the probability that the first person picked their own hat times the probability that the second person picks their own hat, given that the first person already picked their own. So what's the probability that the first person picks their own hat? We know that it's 1/n.

Now, how about the second person? If I tell you that one person has their own hat, and that person takes their hat and goes away, from the point of view of the second person, there's n - 1 people left looking at n - 1 hats. And they're getting just hats at random. What's the chance that I will get my own? It's 1/n - 1. So think of them as person 1 goes, picks a hat at random, it happens to be their own, and it leaves. You're left with n - 1 people, and there are n - 1 hats out there.

Person 2 goes and picks a hat at random, with probability 1/n - 1, is going to pick his own hat. So the expected value now of this random variable is, again, that same number, because this is a 0, 1 random variable. So this is the same as expected value of Xi times Xj when i different than j. So here, all that's left to do is to add the expectations of these terms. Each one of these terms has an expected value that's 1/n times (1/n - 1).

And how many terms do we have? How many of these are we adding up? It's n-squared - n. When you expand the quadratic, there's a total of n-squared terms. Some are self-terms, n of them. And the remaining number of terms is n-squared - n. So here we got n-squared - n terms. And so we need to multiply here with n-squared - n. And after you realize that this number here is 1, and you realize that this is the same as the denominator, you get the answer that the expected value of X squared equals 2. And then, finally going up to the top formula, we get the expected value of X squared, which is 2 - 1, and the variance is just equal to 1.

So the variance of this random variable, number of people who get their own hats back, is also equal to 1, equal to the mean. Looks like magic. Why is this the case? Well, there's a deeper explanation why these two numbers should come out to be the same. But this is something that would probably have to wait a couple of chapters before we could actually explain it. And so I'll stop here.