Description: In this lecture, the professor discussed multiple random variables: conditioning and independence.
Instructor: John Tsitsiklis
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.
JOHN TSITSIKLIS: OK let's start. So we've had the quiz. And I guess there's both good and bad news in it. Yesterday, as you know, the bad news. The average was a little lower than what we would have wanted. On the other hand, the good news is that the distribution was nicely spread. And that's the main purpose of this quiz is basically for you to calibrate and see roughly where you are standing.
The other piece of the good news is that, as you know, this quiz doesn't count for very much in your final grade. So it's really a matter of calibration and to get your mind set appropriately to prepare for the second quiz, which counts a lot more. And it's more substantial. And we'll make sure that the second quiz will have a higher average.
All right. So let's go to our material. We're talking now these days about continuous random variables. And I'll remind you what we discussed last time. I'll remind you of the concept of the probability density function of a single random variable. And then we're going to rush through all the concepts that we covered for the case of discrete random variables and discuss their analogs for the continuous case. And talk about notions such as conditioning independence and so on.
So the big picture is here. We have all those concepts that we developed for the case of discrete random variables. And now we will just talk about their analogs in the continuous case. We already discussed this analog last week, the density of a single random variable.
Then there are certain concepts that show up both in the discrete and the continuous case. So we have the cumulative distribution function, which is a description of the probability distribution of a random variable and which applies whether you have a discrete or continuous random variable. Then there's the notion of the expected value.
And in the two cases, the expected value is calculated in a slightly different way, but not very different. We have sums in one case, integrals in the other. And this is the general pattern that we're going to have. Formulas for the discrete case translate to corresponding formulas or expressions in the continuous case. We generically replace sums by integrals, and we replace must functions with density functions.
Then the new pieces for today are going to be mostly the notion of a joint density function, which is how we describe the probability distribution of two random variables that are somehow related, in general, and then the notion of a conditional density function that tells us the distribution of one random variable X when you're told the value of another random variable Y. There's another concept, which is the conditional PDF given that the certain event has happened. This is a concept that's in some ways simpler. You've already seen a little bit of that in last week's recitation and tutorial.
The idea is that we have a single random variable. It's described by a density. Then you're told that the certain event has occurred. Your model changes the universe that you are dealing with. In the new universe, you are dealing with a new density function, the one that applies given the knowledge that we have that the certain event has occurred.
All right. So what exactly did we say about continuous random variables? The first thing is the definition, that a random variable is said to be continuous if we are given a certain object that we call the probability density function and we can calculate interval probabilities given this density function. So the definition is that the random variable is continuous if you can calculate probabilities associated with that random variable given that formula. So this formula tells you that the probability that your random variable falls inside this interval is the area under the density curve.
OK. There's a few properties that a density function must satisfy. Since we're talking about probabilities, and probabilities are non-negative, we have that the density function is always a non-negative function. The total probability over the entire real line must be equal to 1. So the integral when you integrate over the entire real line has to be equal to 1. That's the second property.
Another property that you get is that if you let a equal to b, this integral becomes 0. And that tells you that the probability of a single point in the continuous case is always equal to 0. So these are formal properties.
When you want to think intuitively, the best way to think about what the density function is to think in terms of little intervals, the probability that my random variable falls inside the little interval. Well, inside that little interval, the density function here is roughly constant. So that integral becomes the value of the density times the length of the interval over which you are integrating, which is delta.
And so the density function basically gives us probabilities of little events, of small events. And the density is to be interpreted as probability per unit length at a certain place in the diagram. So in that place in the diagram, the probability per unit length around this neighborhood would be the height of the density function at that point.
What else? We have a formula for calculating expected values of functions of random variables. In the discrete case, we had the formula where here we had the sum, and instead of the density, we had the PMF. The same formula is also valid in the continuous case. And it's not too hard to derive, but we will not do it.
But let's think of the intuition of what this formula says. You're trying to figure out on the average how much g(X) is going to be. And then you reason, and you say, well, X may turn out to take a particular value or a small interval of values. This is the probability that X falls inside the small interval. And when that happens, g(X) takes that value.
So this fraction of the time, you fall in the little neighborhood of x, and you get so much. Then you average over all the possible x's that can happen. And that gives you the average value of the function g(X).
OK. So this is the easy stuff. Now let's get to the new material. We want to talk about multiple random variables simultaneously. So we want to talk now about two random variables that are continuous, and in some sense that they are jointly continuous.
And let's see what this means. The definition is similar to the definition we had for a single random variable, where I take this formula here as the definition of continuous random variables. Two random variables are said to be jointly continuous if we can calculate probabilities by integrating a certain function that we call the joint density function over the set of interest.
So we have our two-dimensional plane. This is the x-y plane. There's a certain event S that we're interested in. We want to calculate the probability. How do we do that?
We are given this function f_(X,Y), the joint density. It's a function of the two arguments x and y. So think of that function as being some kind of surface that sits on top of the two-dimensional plane. The probability of falling inside the set S, we calculate it by looking at the volume under the surface, that volume that sits on top of S. So the surface underneath it has a certain total volume.
What should that total volume be? Well, we think of these volumes as probabilities. So the total probability should be equal to 1. The total volume under this surface, should be equal to 1. So that's one property that we want our density function to have. So when you integrate over the entire space, this is of the volume under your surface. That should be equal to 1. Of course, since we're talking about probabilities, the joint density should be a non-negative function.
So think of the situation as having one pound of probability that's spread all over your space. And the height of this joint density function basically tells you how much probability tends to be accumulated in certain regions of space as opposed to other parts of the space. So wherever the density is big, that means that this is an area of the two-dimensional plane that's more likely to occur. Where the density is small, that means that those x-y's are less likely to occur.
You have already seen one example of continuous densities. That was the example we had in the very beginning of the class with a uniform distribution on the unit square. That was a special case of a density function that was constant. So all places in the unit square were roughly equally likely as any other places.
But in other models, some parts of the space may be more likely than others. And we describe those relative likelihoods using this density function. So if somebody gives us the density function, this determines for us probabilities of all the subsets of the two-dimensional plane.
Now for an intuitive interpretation, it's good to think about small events. So let's take a particular x here and then x plus delta. So this is a small interval. Take another small interval here that goes from y to y plus delta. And let's look at the event that x falls here and y falls right there.
What is this event? Well, this is the event that will fall inside this little rectangle. Using this rule for calculating probabilities, what is the probability of that rectangle going to be? Well, it should be the integral of the density over this rectangle. Or it's the volume under the surface that sits on top of that rectangle.
Now, if the rectangle is very small, the joint density is not going to change very much in that neighborhood. So we can treat it as a constant. So the volume is going to be the height times the area of the base. The height at that point is whatever the function happens to be around that point. And the area of the base is delta squared.
So this is the intuitive way to understand what a joint density function really tells you. It specifies for you probabilities of little squares, of little rectangles. And it allows you to think of the joint density function as probability per unit area. So these are the units of the density, its probability per unit area in the neighborhood of a certain point.
So what do we do with this density function once we have it in our hands? Well, we can use it to calculate expected values. Suppose that you have a function of two random variables described by a joint density. You can find, perhaps, the distribution of this random variable and then use the basic definition of the expectation. Or you can calculate expectations directly, using the distribution of the original random variables.
This is a formula that's again identical to the formula that we had for the discrete case. In the discrete case, we had a double sum here, and we had PMFs. So the intuition behind this formula is the same that one had for the discrete case. It's just that the mechanics are different.
Then something that we did in the discrete case was to find a way to go from the joint density of the two random variables taken together to the density of just one of the random variables. So we had a formula for the discrete case. Let's see how things are going to work out in the continuous case.
So in the continuous case, we have here our two random variables. And we have a density for them. And let's say that we want to calculate the probability that x falls inside this interval. So we're looking at the probability that our random variable X falls in the interval from little x to x plus delta.
Now, by the properties that we already have for interpreting the density function of a single random variable, the probability of a little interval is approximately the density of that single random variable times delta. And now we want to find a formula for this marginal density in terms of the joint density.
OK. So this is the probability that x falls inside this interval. In terms of the two-dimensional plane, this is the probability that (x,y) falls inside this strip. So to find that probability, we need to calculate the probability that (x,y) falls in here, which is going to be the double integral over the interval over this strip, of the joint density.
And what are we integrating over? y goes from minus infinity to plus infinity. And the dummy variable x goes from little x to x plus delta. So to integrate over this strip, what we do is for any given y, we integrate in this dimension. This is the x integral. And then we integrate over the y dimension.
Now what is this inner integral? Because x only varies very little, this is approximately constant in that range. So the integral with respect to x just becomes delta times f(x,y). And then we've got our dy.
So this is what the inner integral will evaluate to. We are integrating over the little interval. So we're keeping y fixed. Integrating over here, we take the value of the density times how much we're integrating over. And we get this formula.
OK. Now, this expression must be equal to that expression. So if we cancel the deltas, we see that the marginal density must be equal to the integral of the joint density, where we have integrated out the value of y. So this formula should come as no surprise at this point.
It's exactly the same as the formula that we had for discrete random variables. But now we are replacing the sum with an integral. And instead of using the joint PMF, we are using the joint PDF.
Then, continuing going down the list of things we did for discrete random variables, we can now introduce a definition of the notion of independence of two random variables. And by analogy with the discrete case, we define independence to be the following condition. Two random variables are independent if and only if their joint density function factors out as a product of their marginal densities. And this property needs to be true for all x and y. So this is the formal definition.
Operationally and intuitively, what does it mean? Well, intuitively it means the same thing as in the discrete case. Knowing anything about X shouldn't tell you anything about Y. That is, information about X is not going to change your beliefs about Y. We are going to come back to this statement in a second.
The other thing that it allows you to do-- I'm not going to derive this-- is it allows you to calculate probabilities by multiplying individual probabilities. So if you ask for the probability that x falls in a certain set A and y falls in a certain set B, then you can calculate that probability by multiplying individual probabilities. This takes just two lines of derivation, which I'm not going to do. But it comes back to the usual notion of independence of events. Basically, operationally independence means that you can multiply probabilities.
So now let's look at an example. There's a sort of pretty famous and classical one. It goes back a lot more than a 100 years. And it's the famous Needle of Buffon. Buffon was a French naturalist who, for some reason, also decided to play with probability. And look at the following problem.
So you have the two-dimensional plane. And on the plane we draw a bunch of parallel lines. And those parallel lines are separated by a length. And the lines are apart at distance d. And we throw a needle at random, completely at random. And we'll have to give a meaning to what "completely at random" means.
And when we throw a needle, there's two possibilities. Either the needle is going to fall in a way that does not intersect any of the lines, or it's going to fall in a way that it intersects one of the lines. We're taking the needle to be shorter than this distance, so the needle cannot intersect two lines simultaneously. It either intersects 0, or it intersects one of the lines. The question is to find the probability that the needle is going to intersect a line.
What's the probability of this? OK. We are going to approach this problem by using our standard four-step procedure. Set up your sample space, describe a probability law on that sample space, identify the event of interest, and then calculate. These four steps basically correspond to these three bullets and then the last equation down here.
So first thing is to set up a sample space. We need some variables to describe what happened in the experiment. So what happens in the experiment is that the needle lands somewhere. And where it lands, we can describe this by specifying the location of the center of the needle.
And what do we mean by the location of the center? Well, we can take as our variable to be the distance from the center of the needle to the nearest line. So it tells us the vertical distance of the center of the needle from the nearest line.
The other thing that matters is the orientation of the needle. So we need one more variable, which we take to be the angle that the needle is forming with the lines. We can put the angle here, or you can put in there. Yes, it's still the same angle.
So we have these two variables that described what happened in the experiment. And we can take our sample space to be the set of all possible x's and theta's. What are the possible x's? The lines are d apart, so the nearest line is going to be anywhere between 0 and d/2 away. So that tells us what the possible x's will be.
As for theta, it really depends how you define your angle. We are going to define our theta to be the acute angle that's formed between the needle and a line, if you were to extend it. So theta is going to be something between 0 and pi/2. So I guess these red pieces really correspond to the part of setting up the sample space. OK. So that's part one.
Second part is we need a model. OK. Let's take our model to be that we basically know nothing about how the needle falls. It can fall in any possible way, and all possible ways are equally likely.
Now, if you have those parallel lines, and you close your eyes completely and throw a needle completely at random, any x should be equally likely. So we describe that situation by saying that X should have a uniform distribution. That is, it should have a constant density over the range of interest.
Similarly, if you kind of spin your needle completely at random, any angle should be as likely as any other angle. And we decide to model this situation by saying that theta also has a uniform distribution over the range of interest. And finally, where we put it should have nothing to do with how much we rotate it. And we capture this mathematically by saying that X is going to be independent of theta.
Now, this is going to be our model. I'm not deriving the model from anything. I'm only saying that this sounds like a model that does not assume any knowledge or preference for certain values of x rather than other values of theta. In the absence of any other particular information you might have in your hands, that's the most reasonable model to come up with. So you model the problem that way.
So what's the formula for the joint density? It's going to be the product of the densities of X and Theta. Why is it the product? This is because we assumed independence.
And the density of X, since it's uniform, and since it needs to integrate to 1, that density needs to be 2/d. That's the density of X. And the density of Theta needs to be 2/pi. That's the value for the density of Theta so that the overall probability over this interval ends up being 1.
So now we do have our joint density in our hands. The next thing to do is to identify the event of interest. And this is best done in a picture. And there's two possible situations that one could have. Either the needle falls this way, or it falls this way.
So how can we tell if one or the other is going to happen? It has to do with whether this interval here is smaller than that or bigger than that. So we are comparing the height of this interval to that interval. This interval here is capital X.
This interval here, what is it? This is half of the length of the needle, which is l/2. To find this height, we take l/2 and multiply it with the sine of the angle that we have. So the length of this interval up here is l/2 times sine theta.
If this is smaller than x, the needle does not intersect the line. If this is bigger than x, then the needle intersects the line. So the event of interest, that the needle intersects the line, is described this way in terms of x and theta.
And now that we have the event of interest described mathematically, all that we need to do is to find the probability of this event, we integrate the joint density over the part of (x, theta) space in which this inequality is true. So it's a double integral over the set of all x's and theta's where this is true. The way to do this integral is we fix theta, and we integrate for x's that go from 0 up to that number. And theta can be anything between 0 and pi/2. So the integral over this set is basically this double integral here.
We already have a formula for the joint density. It's 4 over pi d, so we put it here. And now, fortunately, this is a pretty easy integral to evaluate. The integral with respect to x -- there's nothing in here. So the integral is just the length of the interval over which we're integrating. It's l/2 sine theta.
And then we need to integrate this with respect to theta. We know that the integral of a sine is a negative cosine. You plug in the values for the negative cosine at the two end points. I'm sure you can do this integral . And we finally obtain the answer, which is amazingly simple for such a pretty complicated-looking problem. It's 2l over pi d.
So some people a long, long time ago, after they looked at this answer, they said that maybe that gives us an interesting way where one could estimate the value by pi, for example, experimentally. How do you do that? Fix l and d, the dimensions of the problem. Throw a million needles on your piece of paper. See how often your needless do intersect the line.
That gives you a number for this quantity. You know l and d, so you can use that to infer pi. And there's an apocryphal story about a wounded soldier in a hospital after the American Civil War who actually had heard about this and was spending his time in the hospital throwing needles on pieces of paper. I don't know if it's true or not. But let's do something similar here.
So let's look at this diagram. We fix the dimensions. This is supposed to be our little d. That's supposed to be our little l. We have the formula from the previous slide that p is 2l over pi d. In this instance, we choose d to be twice l. So this number is 1/pi. So the probability that the needle hits the line is 1/pi.
So I need needles that are 3.1 centimeters long. I couldn't find such needles. But I could find paper clips that are 3.1 centimeters long. So let's start throwing paper clips at random and see how many of them will end up intersecting the lines. Good.
OK. So out of eight paper clips, we have exactly four that intersected the line. So our estimate for the probability of intersecting the line is 1/2, which gives us an estimate for the value of pi, which is two. Well, I mean, within an engineering approximation, we're in the right ballpark, right?
So this might look like a silly way of trying to estimate pi. And it probably is. On the other hand, this kind of methodology is being used especially by physicists and also by statisticians. It's used a lot. When is it used?
If you have an integral to calculate, such as this integral, but you're not lucky, and your functions are not so simple where you can do your calculations by hand, and maybe the dimensions are larger-- instead of two random variables you have 100 random variables, so it's a 100-fold integral-- then there's no way to do that in the computer. But the way that you can actually do it is by generating random samples of your random variables, doing that simulation over and over many times. That is, by interpreting an integral as a probability, you can use simulation to estimate that probability. And that gives you a way of calculating integrals.
And physicists do actually use that a lot, as well as statisticians, computer scientists, and so on. It's a so-called Monte Carlo method for evaluating integrals. And it's a basic piece of the toolbox in science these days.
Finally, the harder concept of the day is the idea of conditioning. And here things become a little subtle when you deal with continuous random variables. OK. First, remember again our basic interpretation of what a density is. A density gives us probabilities of little intervals.
So how should we define conditional densities? Conditional densities should again give us probabilities of little intervals, but inside a conditional world where we have been told something about the other random variable. So what we would like to be true is the following. We would like to define a concept of a conditional density of a random variable X given the value of another random variable Y. And it should behave the following way, that the conditional density gives us the probability of little intervals-- same as here-- given that we are told the value of y.
And here's where the subtleties come. The main thing to notice is that here I didn't write "equal," I wrote "approximately equal." Why do we need that?
Well, the thing is that conditional probabilities are not defined when you condition on an event that has 0 probability. So we need the conditioning event here to have posed this probability. So instead of saying that Y is exactly equal to little y, we want to instead say we're in a new universe where capital Y is very close to little y.
And then this notion of "very close" kind of takes the limit and takes it to be infinitesimally close. So this is the way to interpret conditional probabilities. That's what they should mean.
Now, in practice, when you actually use probability, you forget about that subtlety. And you say, well, I've been told that Y is equal to 1.3. Give me the conditional distribution of X. But formally or rigorously, you should say I'm being told that Y is infinitesimally close to 1.3. Tell me the distribution of X.
Now, if this is what we want, what should this quantity be? It's a conditional probability, so it should be the probability of two things happening-- X being close to little x, Y being close to little y. And that's basically given to us by the joint density divided by the probability of the conditioning event, which has something to do with the density of Y itself. And if you do things carefully, you see that the only way to satisfy this relation is to define the conditional density by this particular formula.
OK. Big discussion to come down in the end to what you should have probably guessed by now. We just take any formulas and expressions from the discrete case and replace PMFs by PDFs. So the conditional PDF is defined by this formula where here we have joint PDF and marginal PDF, as opposed to the discrete case where we had the joint PMF and the marginal PMF. So in some sense, it's just a syntactic change.
In another sense, it's a little subtler on how you actually interpret it. Speaking about interpretation, what are some ways of thinking about the joint density? Well, the best way to think about it is that somebody has fixed little y for you. So little y is being fixed here. And we look at this density as a function of X.
I've told you what Y is. Tell me what you know about X. And you tell me that X has a certain distribution. What does that distribution look like? It has exactly the same shape as the joint density.
Remember, we fixed Y. So this is a constant. So the only thing that varies is X. So we get the function that behaves like the joint density when you fix y, which is really you take the joint density, and you take a slice of it.
You fix a y, and you see how it varies with x. So in that sense, the conditional PDF is just a slice of the joint PDF. But we need to divide by a certain number, which just scales it and changes its shape.
We're coming back to a picture in a second. But before going to the picture, lets go back to the interpretation of independence. If the two random the variables are independent, according to our definition in the previous slide, the joint density is going to factor as the product of the marginal densities. The density of Y in the numerator cancels the density in the denominator. And we're just left with the density of X.
So in the case of independence, what we get is that the conditional is the same as the marginal. And that solidifies our intuition that in the case of independence, being told something about the value of Y does not change our beliefs about how X is distributed. So whatever we expected about X is going to remain true even after we are told something about Y.
So let's look at some pictures. Here is what the joint PDF might look like. Here we've got our x and y-axis. And if you want to calculate the probability of a certain event, what you do is you look at that event and you see how much of that mass is sitting on top of that event.
Now let's start slicing. Let's fix a value of x and look along that slice where we obtain this function. Now what does that slice do? That slice tells us for that particular x what the possible values of y are going to be and how likely they are.
If we integrate over all y's, what do we get? Integrating over all y's just gives us the marginal density of X. It's the calculation that we did here.
By integrating over all y's, we find the marginal density of X. So the total area under that slice gives us the marginal density of X. And by looking at the different slices, we find how likely the different values of x are going to be.
How about the conditional? If we're interested in the conditional of Y given X, how would you think about it? This refers to a universe where we are told that capital X takes on a specific value.
So we put ourselves in the universe where this line has happened. There's still possible values of y that can happen. And this shape kind of tells us the relative likelihoods of the different y's. And this is indeed going to be the shape of the conditional distribution of Y given that X has occurred.
On the other hand, the conditional distribution must add up to 1. So the total probability over all of the different y's in this universe, that total probability should be equal to 1. Here it's not equal to 1. The total area is the marginal density. To make it equal to 1, we need to divide by the marginal density, which is basically to renormalize this shape so that the total area under that slice, under that shape, is equal to 1.
So we start with the joint. We take the slices. And then we adjust the slices so that every slice has an area underneath equal to 1. And this gives us the conditional.
So for example, down here-- you can not even see it in this diagram-- but after you renormalize it so that its total area is equal to 1, you get this sort of narrow spike that goes up. And so this is a plot of the conditional distributions that you get for the different values of x. Given a particular value of x, you're going to get this certain conditional distribution.
So this picture is worth about as much as anything else in this particular chapter. Make sure you kind of understand exactly all these pieces of the picture. And finally, let's go, in the remaining time, through an example where we're going to throw in the bucket all the concepts and notations that we have introduced so far. So the example is as follows.
We start with a stick that has a certain length. And we break it a completely random location. And-- yes, this 1 should be l. OK. So it has length l. And we're going to break it at the random place. And we call that random place where we break it, we call it X.
X can be anywhere, uniform distribution. So this means that X has a density that goes from 0 to l. I guess this capital L is supposed to be the same as the lower-case l. So that's the density of X. And since the density needs to integrate to 1, the height of that density has to be 1/l.
Now, having broken the stick and given that we are left with this piece of the stick, I'm now going to break it again at a completely random place, meaning I'm going to choose a point where I break it uniformly over the length of the stick. What does this mean? And let's call Y the location where I break it. So Y is going to range between 0 and x. x is the stick that I'm left with.
So I'm going to break it somewhere in between. So I pick a y between 0 and x. And of course, x is less than l. And I'm going to break it there. So y is uniform between 0 and x.
What does that mean, that the density of y, given that you have already told me x, ranges from 0 to little x? If I told you that the first break happened at a particular x, then y can only range over this interval. And I'm assuming a uniform distribution over that interval. So we have this kind of shape. And that fixes for us the height of the conditional density.
So what's the joint density of those two random variables? By the definition of conditional densities, the conditional was defined as the ratio of this divided by that. So we can find the joint density by taking the marginal and then multiplying by the conditional. This is the same formula as in the discrete case. This is our very familiar multiplication rule, but adjusted to the case of continuous random variables. So Ps become Fs.
OK. So we do have a formula for this. What is it? It's 1/l-- that's the density of X -- times 1/x, which is the conditional density of Y. This is the formula for the joint density.
But we must be careful. This is a formula that's not valid anywhere. It's only valid for the x's and y's that are possible.
And the x's and y's that are possible are given by these inequalities. So x can range from 0 to l, and y can only be smaller than x. So this is the formula for the density on this part of our space. The density is 0 anywhere else.
So what does it look like? It's basically a 1/x function. So it's sort of constant along that dimension. But as x goes to 0, your density goes up and can even blow up. It sort of looks like a sail that's raised and somewhat curved and has a point up there going to infinity. So this is the joint density.
Now once you have in your hands a joint density, then you can answer in principle any problem. It's just a matter of plugging in and doing computations. How about calculating something like a conditional expectation of Y given a value of x? OK. That's a concept we have not defined so far. But how should we define it?
Means the reasonable thing. We'll define it the same way as ordinary expectations except that since we're given some conditioning information, we should use the probability distribution that applies to that particular situation. So in a situation where we are told the value of x, the distribution that applies is the conditional distribution of Y. So it's going to be the conditional density of Y given the value of x.
Now, we know what this is. It's given by 1/x. So we need to integrate y times 1/x dy.
And what should we integrate over? Well, given the value of x, y can only range from 0 to x. So this is what we get. And you do your integral, and you get that this is x/2.
Is it a surprise? It shouldn't be. This is just the expected value of Y in a universe where X has been realized and Y is given by this distribution. Y is uniform between 0 and x. The expected value of Y should be the midpoint of this interval, which is x/2.
Now let's do fancier stuff. Since we have the joint distribution, we should be able to calculate the marginal. What is the distribution of Y? After breaking the stick twice, how big is the little piece that I'm left with? How do we find this?
To find the marginal, we just take the joint and integrate out the variable that we don't want. A particular y can happen in many ways. It can happen together with any x. So we consider all the possible x's that can go together with this y and average over all those x's.
So we plug in the formula for the joint density from the previous slide. We know that it's 1/lx. And what's the range of the x's? So to find the density of Y for a particular y up here, I'm going to integrate over x's. The density is 0 here and there. The density is nonzero only in this part. So I need to integrate over x's going from here to there.
So what's the "here"? This line goes up at the slope of 1. So this is the line x equals y. So if I fix y, it means that my integral starts from a value of x that is also equal to y.
So where the integral starts from is at x equals y. And it goes all the way until the end of the length of our stick, which is l. So we need to integrate from little y up to l.
So that's something that almost always comes up. It's not enough to have just this formula for integrating the joint density. You need to keep track of different regions. And if the joint density is 0 in some regions, then you exclude those regions from the range of integration. So the range of integration is only over those values where the particular formula is valid, the places where the joint density is nonzero.
All right. The integral of 1/x dx, that gives you a logarithm. So we evaluate this integral, and we get an expression of this kind. So the density of Y has a somewhat unexpected shape. So it's a logarithmic function.
And it goes this way. It's for y going all the way to l. When y is equal to l, the logarithm of 1 is equal to 0. But when y approaches 0, logarithm of something big blows up, and we get a shape of this form.
OK. Finally, we can calculate the expected value of Y. And we can do this by using the definition of the expectation. So integral of y times the density of y. We already found what that density is, so we can plug it in here. And we're integrating over the range of possible y's, from 0 to l.
Now this involves the integral for y log y, which I'm sure you have encountered in your calculus classes but maybe do not remember how to do it. In any case, you look it up in some integral tables or do it by parts. And you get the final answer of l/4.
And at this point, you say, that's a really simple answer. Shouldn't I have expected it to be l/4? I guess, yes. I mean, when you break it once, the expected value of what you are left with is going to be 1/2 of what you started with. When you break it the next time, the expected length of what you're left with should be 1/2 of the piece that you are now breaking.
So each time that you break it at random, you expected it to become smaller by a factor of 1/2. So if you break it twice, you are left something that's expected to be 1/4. This is reasoning on the average, which happens to give you the right answer in this case. But again, there's the warning that reasoning on the average doesn't always give you the right answer. So be careful about doing arguments of this type.
Very good. See you on Wednesday.