Lecture 8: Continuous Random Variables

Flash and JavaScript are required for this feature.

Description: In this lecture, the professor discussed probability density functions, cumulative distribution functions, and normal random variables.

Instructor: John Tsitsiklis

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

JOHN TSITSIKLIS: OK. We can start. Good morning.

So we're going to start now a new unit. For the next couple of lectures, we will be talking about continuous random variables. So this is new material which is not going to be in the quiz. You are going to have a long break next week without any lecture, just a quiz and recitation and tutorial.

So what's going to happen in this new unit? Basically, we want to do everything that we did for discrete random variables, reintroduce the same sort of concepts but see how they apply and how they need to be modified in order to talk about random variables that take continuous values. At some level, it's all the same. At some level, it's quite a bit harder because when things are continuous, calculus comes in. So the calculations that you have to do on the side sometimes need a little bit more thinking.

In terms of new concepts, there's not going to be a whole lot today, some analogs of things we have done. We're going to introduce the concept of cumulative distribution functions, which allows us to deal with discrete and continuous random variables, all of them in one shot. And finally, introduce a famous kind of continuous random variable, the normal random variable.

OK, so what's the story? Continuous random variables are random variables that take values over the continuum. So the numerical value of the random variable can be any real number. They don't take values just in a discrete set.

So we have our sample space. The experiment happens. We get some omega, a sample point in the sample space. And once that point is determined, it determines the numerical value of the random variable.

Remember, random variables are functions on the sample space. You pick a sample point. This determines the numerical value of the random variable. So that numerical value is going to be some real number on that line.

Now we want to say something about the distribution of the random variable. We want to say which values are more likely than others to occur in a certain sense. For example, you may be interested in a particular event, the event that the random variable takes values in the interval from a to b. And we want to say something about the probability of that event.

In principle, how is this done? You go back to the sample space, and you find all those outcomes for which the value of the random variable happens to be in that interval. The probability that the random variable falls here is the same as the probability of all outcomes that make the random variable to fall in there. So in principle, you can work on the original sample space, find the probability of this event, and you would be done.

But similar to what happened in chapter 2, we want to kind of push the sample space in the background and just work directly on the real axis and talk about probabilities up here. So we want now a way to specify probabilities, how they are bunched together, or arranged, along the real line. So what did we do for discrete random variables? We introduced PMFs, probability mass functions. And the way that we described the random variable was by saying this point has so much mass on top of it, that point has so much mass on top of it, and so on.

And so we assigned a total amount of 1 unit of probability. We assigned it to different masses, which we put at different points on the real axis. So that's what you do if somebody gives you a pound of discrete stuff, a pound of mass in little chunks. And you place those chunks at a few points.

Now, in the continuous case, this total unit of probability mass does not sit just on discrete points but is spread all over the real axis. So now we're going to have a unit of mass that spreads on top of the real axis. How do we describe masses that are continuously spread?

The way we describe them is by specifying densities. That is, how thick is the mass that's sitting here? How dense is the mass that's sitting there? So that's exactly what we're going to do. We're going to introduce the concept of a probability density function that tells us how probabilities accumulate at different parts of the real axis.

So here's an example or a picture of a possible probability density function. What does that density function kind of convey intuitively? Well, that these x's are relatively less likely to occur. Those x's are somewhat more likely to occur because the density is higher.

Now, for a more formal definition, we're going to say that a random variable X is said to be continuous if it can be described by a density function in the following sense. We have a density function. And we calculate probabilities of falling inside an interval by finding the area under the curve that sits on top of that interval.

So that's sort of the defining relation for continuous random variables. It's an implicit definition. And it tells us a random variable is continuous if we can calculate probabilities this way. So the probability of falling in this interval is the area under this curve. Mathematically, it's the integral of the density over this particular interval. If the density happens to be constant over that interval, the area under the curve would be the length of the interval times the height of the density, which sort of makes sense.

Now, because the density is not constant but it kind of moves around, what you need is to write down an integral. Now, this formula is very much analogous to what you would do for discrete random variables. For a discrete random variable, how do you calculate this probability?

You look at all x's in this interval. And you add the probability mass function over that range. So just for comparison, this would be the formula for the discrete case-- the sum over all x's in the interval from a to b over the probability mass function. And there is a syntactic analogy that's happening here and which will be a persistent theme when we deal with continuous random variables.

Sums get replaced by integrals. In the discrete case, you add. In the continuous case, you integrate. Mass functions get replaced by density functions. So you can take pretty much any formula from the discrete case and translate it to a continuous analog of that formula, as we're going to see.

OK. So let's take this now as our model. What is the probability that the random variable takes a specific value if we have a continuous random variable? Well, this would be the case. It's a case of a trivial interval, where the two end points coincide. So it would be the integral from a to itself. So you're integrating just over a single point.

Now, when you integrate over a single point, the integral is just 0. The area under the curve, if you're only looking at a single point, it's 0. So big property of continuous random variables is that any individual point has 0 probability.

In particular, when you look at the value of the density, the density does not tell you the probability of that point. The point itself has 0 probability. So the density tells you something a little different. We are going to see shortly what that is.

Before we get there, can the density be an arbitrary function? Almost, but not quite. There are two things that we want. First, since densities are used to calculate probabilities, and since probabilities must be non-negative, the density should also be non-negative. Otherwise you would be getting negative probabilities, which is not a good thing. So that's a basic property that any density function should obey.

The second property that we need is that the overall probability of the entire real line should be equal to 1. So if you ask me, what is the probability that x falls between minus infinity and plus infinity, well, we are sure that x is going to fall in that range. So the probability of that event should be 1. So the probability of being between minus infinity and plus infinity should be 1, which means that the integral from minus infinity to plus infinity should be 1. So that just tells us that there's 1 unit of total probability that's being spread over our space.

Now, what's the best way to think intuitively about what the density function does? The interpretation that I find most natural and easy to convey the meaning of a density is to look at probabilities of small intervals. So let us take an x somewhere here and then x plus delta just next to it. So delta is a small number.

And let's look at the probability of the event that we get a value in that range. For continuous random variables, the way we find the probability of falling in that range is by integrating the density over that range. So we're drawing this picture. And we want to take the area under this curve.

Now, what happens if delta is a fairly small number? If delta is pretty small, our density is not going to change much over that range. So you can pretend that the density is approximately constant. And so to find the area under the curve, you just take the base times the height.

And it doesn't matter where exactly you take the height in that interval, because the density doesn't change very much over that interval. And so the integral becomes just base times the height. So for small intervals, the probability of a small interval is approximately the density times delta. So densities essentially give us probabilities of small intervals.

And if you want to think about it a little differently, you can take that delta from here and send it to the denominator there. And what this tells you is that the density is probability per unit length for intervals of small length. So the units of density are probability per unit length.

Densities are not probabilities. They are rates at which probabilities accumulate, probabilities per unit length. And since densities are not probabilities, they don't have to be less than 1.

Ordinary probabilities always must be less than 1. But density is a different kind of thing. It can get pretty big in some places. It can even sort of blow up in some places. As long as the total area under the curve is 1, other than that, the curve can do anything that it wants.

Now, the density prescribes for us the probability of intervals. Sometimes we may want to find the probability of more general sets. How would we do that? Well, for nice sets, you will just integrate the density over that nice set.

I'm not quite defining what "nice" means. That's a pretty technical topic in the theory of probability. But for our purposes, usually we will take b to be something like a union of intervals.

So how do you find the probability of falling in the union of two intervals? Well, you find the probability of falling in that interval plus the probability of falling in that interval. So it's the integral over this interval plus the integral over that interval.

And you think of this as just integrating over the union of the two intervals. So once you can calculate probabilities of intervals, then usually you are in business, and you can calculate anything else you might want. So the probability density function is a complete description of any statistical information we might be interested in for a continuous random variable.

OK. So now we can start walking through the concepts and the definitions that we have for discrete random variables and translate them to the continuous case. The first big concept is the concept of the expectation. One can start with a mathematical definition.

And here we put down a definition by just translating notation. Wherever we have a sum in the discrete case, we now write an integral. And wherever we had the probability mass function, we now throw in the probability density function.

This formula-- you may have seen it in freshman physics-- basically, it again gives you the center of gravity of the picture that you have when you have the density. It's the center of gravity of the object sitting underneath the probability density function. So that the interpretation still applies.

It's also true that our conceptual interpretation of what an expectation means is also valid in this case. That is, if you repeat an experiment a zillion times, each time drawing an independent sample of your random variable x, in the long run, the average that you are going to get should be the expectation. One can reason in a hand-waving way, sort of intuitively, the way we did it for the case of discrete random variables. But this is also a theorem of some sort. It's a limit theorem that we're going to visit later on in this class.

Having defined the expectation and having claimed that the interpretation of the expectation is that same as before, then we can start taking just any formula you've seen before and just translate it. So for example, to find the expected value of a function of a continuous random variable, you do not have to find the PDF or PMF of g(X). You can just work directly with the original distribution of the random variable capital X.

And this formula is the same as for the discrete case. Sums get replaced by integrals. And PMFs get replaced by PDFs. And in particular, the variance of a random variable is defined again the same way.

The variance is the expected value, the average of the distance of X from the mean and then squared. So it's the expected value for a random variable that takes these numerical values. And same formula as before, integral and F instead of summation, and the P.

And the formulas that we have derived or formulas that you have seen for the discrete case, they all go through the continuous case. So for example, the useful relation for variances, which is this one, remains true. All right. So time for an example.

The most simple example of a continuous random variable that there is, is the so-called uniform random variable. So the uniform random variable is described by a density which is 0 except over an interval. And over that interval, it is constant. What is it meant to convey? It's trying to convey the idea that all x's in this range are equally likely.

Well, that doesn't say very much. Any individual x has 0 probability. So it's conveying a little more than that. What it is saying is that if I take an interval of a given length delta, and I take another interval of the same length, delta, under the uniform distribution, these two intervals are going to have the same probability.

So being uniform means that intervals of same length have the same probability. So no interval is more likely than any other to occur. And in that sense, it conveys the idea of sort of complete randomness. Any little interval in our range is equally likely as any other little interval.

All right. So what's the formula for this density? I only told you the range. What's the height? Well, the area under the density must be equal to 1. Total probability is equal to 1. And so the height, inescapably, is going to be 1 over (b minus a). That's the height that makes the density integrate to 1. So that's the formula.

And if you don't want to lose one point in your exam, you have to say that it's also 0, otherwise. OK. All right? That's sort of the complete answer.

How about the expected value of this random variable? OK. You can find the expected value in two different ways. One is to start with the definition. And so you integrate over the range of interest times the density. And you figure out what that integral is going to be.

Or you can be a little more clever. Since the center-of-gravity interpretation is still true, it must be the center of gravity of this picture. And the center of gravity is, of course, the midpoint. Whenever you have symmetry, the mean is always the midpoint of the diagram that gives you the PDF. OK. So that's the expected value of X.

Finally, regarding the variance, well, there you will have to do a little bit of calculus. We can write down the definition. So it's an integral instead of a sum. A typical value of the random variable minus the expected value, squared, times the density. And we integrate. You do this integral, and you find it's (b minus a) squared over that number, which happens to be 12.

Maybe more interesting is the standard deviation itself. And you see that the standard deviation is proportional to the width of that interval. This agrees with our intuition, that the standard deviation is meant to capture a sense of how spread out our distribution is. And the standard deviation has the same units as the random variable itself. So it's sort of good to-- you can interpret it in a reasonable way based on that picture.

OK, yes. Now, let's go up one level and think about the following. So we have formulas for the discrete case, formulas for the continuous case. So you can write them side by side. One has sums, the other has integrals.

Suppose you want to make an argument and say that something is true for every random variable. You would essentially need to do two separate proofs, for discrete and for continuous. Is there some way of dealing with random variables just one at a time, in one shot, using a sort of uniform notation? Is there a unifying concept?

Luckily, there is one. It's the notion of the cumulative distribution function of a random variable. And it's a concept that applies equally well to discrete and continuous random variables. So it's an object that we can use to describe distributions in both cases, using just one piece of notation.

So what's the definition? It's the probability that the random variable takes values less than a certain number little x. So you go to the diagram, and you see what's the probability that I'm falling to the left of this. And you specify those probabilities for all x's.

In the continuous case, you calculate those probabilities using the integral formula. So you integrate from here up to x. In the discrete case, to find the probability to the left of some point, you go here, and you add probabilities again from the left.

So the way that the cumulative distribution function is calculated is a little different in the continuous and discrete case. In one case you integrate. In the other, you sum. But leaving aside how it's being calculated, what the concept is, it's the same concept in both cases. So let's see what the shape of the cumulative distribution function would be in the two cases.

So here what we want is to record for every little x the probability of falling to the left of x. So let's start here. Probability of falling to the left of here is 0-- 0, 0, 0. Once we get here and we start moving to the right, the probability of falling to the left of here is the area of this little rectangle. And the area of that little rectangle increases linearly as I keep moving. So accordingly, the CDF increases linearly until I get to that point.

At that point, what's the value of my CDF? 1. I have accumulated all the probability there is. I have integrated it.

This total area has to be equal to 1. So it reaches 1, and then there's no more probability to be accumulated. It just stays at 1. So the value here is equal to 1.

OK. How would you find the density if somebody gave you the CDF? The CDF is the integral of the density. Therefore, the density is the derivative of the CDF. So you look at this picture and take the derivative.

Derivative is 0 here, 0 here. And it's a constant up there, which corresponds to that constant. So more generally, and an important thing to know, is that the derivative of the CDF is equal to the density-- almost, with a little bit of an exception.

What's the exception? At those places where the CDF does not have a derivative-- here where it has a corner-- the derivative is undefined. And in some sense, the density is also ambiguous at that point. Is my density at the endpoint, is it 0 or is it 1?

It doesn't really matter. If you change the density at just a single point, it's not going to affect the value of any integral you ever calculate. So the value of the density at the endpoint, you can leave it as being ambiguous, or you can specify it.

It doesn't matter. So at all places where the CDF has a derivative, this will be true. At those places where you have corners, which do show up sometimes, well, you don't really care.

How about the discrete case? In the discrete case, the CDF has a more peculiar shape. So let's do the calculation. We want to find the probability of b to the left of here. That probability is 0, 0, 0. Once we cross that point, the probability of being to the left of here is 1/6. So as soon as we cross the point 1, we get the probability of 1/6, which means that the size of the jump that we have here is 1/6.

Now, question. At this point 1, which is the correct value of the CDF? Is it 0, or is it 1/6? It's 1/6 because-- you need to look carefully at the definitions, the probability of x being less than or equal to little x.

If I take little x to be 1, it's the probability that capital X is less than or equal to 1. So it includes the event that x is equal to 1. So it includes this probability here. So at jump points, the correct value of the CDF is going to be this one.

And now as I trace, x is going to the right. As soon as I cross this point, I have added another 3/6 probability. So that 3/6 causes a jump to the CDF. And that determines the new value. And finally, once I cross the last point, I get another jump of 2/6.

A general moral from these two examples and these pictures. CDFs are well defined in both cases. For the case of continuous random variables, the CDF will be a continuous function. It starts from 0. It eventually goes to 1 and goes smoothly-- well, continuously from smaller to higher values. It can only go up. It cannot go down since we're accumulating more and more probability as we are going to the right.

In the discrete case, again it starts from 0, and it goes to 1. But it does it in a staircase manner. And you get a jump at each place where the PMF assigns a positive mass. So jumps in the CDF are associated with point masses in our distribution. In the continuous case, we don't have any point masses, so we do not have any jumps either.

Now, besides saving us notation-- we don't have to deal with discrete and continuous twice-- CDFs give us actually a little more flexibility. Not all random variables are continuous or discrete. You can cook up random variables that are kind of neither or a mixture of the two.

An example would be, let's say you play a game. And with a certain probability, you get a certain number of dollars in your hands. So you flip a coin. And with probability 1/2, you get a reward of 1/2 dollars.

And with probability 1/2, you are led to a dark room where you spin a wheel of fortune. And that wheel of fortune gives you a random reward between 0 and 1. So any of these outcomes is possible. And the amount that you're going to get, let's say, is uniform.

So you flip a coin. And depending on the outcome of the coin, either you get a certain value or you get a value that ranges over a continuous interval. So what kind of random variable is it?

Is it continuous? Well, continuous random variables assign 0 probability to individual points. Is it the case here? No, because you have positive probability of obtaining 1/2 dollar. So our random variable is not continuous.

Is it discrete? It's not discrete, because our random variable can take values also over a continuous range. So we call such a random variable a mixed random variable.

If you were to draw its distribution very loosely, probably you would want to draw a picture like this one, which kind of conveys the idea of what's going on. So just think of this as a drawing of masses that are sitting over a table. We place an object that weighs half a pound, but it's an object that takes zero space. So half a pound is just sitting on top of that point. And we take another half-pound of probability and spread it uniformly over that interval.

So this is like a piece that comes from mass functions. And that's a piece that looks more like a density function. And we just throw them together in the picture. I'm not trying to associate any formal meaning with this picture. It's just a schematic of how probabilities are distributed, help us visualize what's going on.

Now, if you have taken classes on systems and all of that, you may have seen the concept of an impulse function. And you my start saying that, oh, I should treat this mathematically as a so-called impulse function. But we do not need this for our purposes in this class. Just think of this as a nice picture that conveys what's going on in this particular case.

So now, what would the CDF look like in this case? The CDF is always well defined, no matter what kind of random variable you have. So the fact that it's not continuous, it's not discrete shouldn't be a problem as long as we can calculate probabilities of this kind.

So the probability of falling to the left here is 0. Once I start crossing there, the probability of falling to the left of a point increases linearly with how far I have gone. So we get this linear increase. But as soon as I cross that point, I accumulate another 1/2 unit of probability instantly. And once I accumulate that 1/2 unit, it means that my CDF is going to have a jump of 1/2.

And then afterwards, I still keep accumulating probability at a fixed rate, the rate being the density. And I keep accumulating, again, at a linear rate until I settle to 1. So this is a CDF that has certain pieces where it increases continuously. And that corresponds to the continuous part of our randomize variable. And it also has some places where it has discrete jumps. And those district jumps correspond to places in which we have placed a positive mass.

And by the-- OK, yeah. So this little 0 shouldn't be there. So let's cross it out. All right.

So finally, we're going to take the remaining time and introduce our new friend. It's going to be the Gaussian or normal distribution. So it's the most important distribution there is in all of probability theory. It's plays a very central role. It shows up all over the place. We'll see later in the class in more detail why it shows up.

But the quick preview is the following. If you have a phenomenon in which you measure a certain quantity, but that quantity is made up of lots and lots of random contributions-- so your random variable is actually the sum of lots and lots of independent little random variables-- then invariability, no matter what kind of distribution the little random variables have, their sum will turn out to have approximately a normal distribution. So this makes the normal distribution to arise very naturally in lots and lots of contexts. Whenever you have noise that's comprised of lots of different independent pieces of noise, then the end result will be a random variable that's normal.

So we are going to come back to that topic later. But that's the preview comment, basically to argue that it's an important one. OK. And there's a special case. If you are dealing with a binomial distribution, which is the sum of lots of Bernoulli random variables, again you would expect that the binomial would start looking like a normal if you have many, many-- a large number of point fields.

All right. So what's the math involved here? Let's parse the formula for the density of the normal. What we start with is the function X squared over 2. And if you are to plot X squared over 2, it's a parabola, and it has this shape -- X squared over 2.

Then what do we do? We take the negative exponential of this. So when X squared over 2 is 0, then negative exponential is 1. When X squared over 2 increases, the negative exponential of that falls off, and it falls off pretty fast.

So as this goes up, the formula for the density goes down. And because exponentials are pretty strong in how quickly they fall off, this means that the tails of this distribution actually do go down pretty fast. OK. So that explains the shape of the normal PDF.

How about this factor 1 over square root 2 pi? Where does this come from? Well, the integral has to be equal to 1. So you have to go and do your calculus exercise and find the integral of this the minus X squared over 2 function and then figure out, what constant do I need to put in front so that the integral is equal to 1?

How do you evaluate that integral? Either you go to Mathematica or Wolfram's Alpha or whatever, and it tells you what it is. Or it's a very beautiful calculus exercise that you may have seen at some point. You throw in another exponential of this kind, you bring in polar coordinates, and somehow the answer comes beautifully out there.

But in any case, this is the constant that you need to make it integrate to 1 and to be a legitimate density. We call this the standard normal. And for the standard normal, what is the expected value? Well, the symmetry, so it's equal to 0.

What is the variance? Well, here there's no shortcut. You have to do another calculus exercise. And you find that the variance is equal to 1. OK. So this is a normal that's centered around 0.

How about other types of normals that are centered at different places? So we can do the same kind of thing. Instead of centering it at 0, we can take some place where we want to center it, write down a quadratic such as (X minus mu) squared, and then take the negative exponential of that. And that gives us a normal density that's centered at mu.

Now, I may wish to control the width of my density. To control the width of my density, equivalently I can control the width of my parabola. If my parabola is narrower, if my parabola looks like this, what's going to happen to the density? It's going to fall off much faster.

OK. How do I make my parabola narrower or wider? I do it by putting in a constant down here. So by putting a sigma here, this stretches or widens my parabola by a factor of sigma. Let's see. Which way does it go?

If sigma is very small, this is a big number. My parabola goes up quickly, which means my normal falls off very fast. So small sigma corresponds to a narrower density.

And so it, therefore, should be intuitive that the standard deviation is proportional to sigma. Because that's the amount by which you are scaling the picture. And indeed, the standard deviation is sigma. And so the variance is sigma squared.

So all that we have done here to create a general normal with a given mean and variance is to take this picture, shift it in space so that the mean sits at mu instead of 0, and then scale it by a factor of sigma. This gives us a normal with a given mean and a given variance. And the formula for it is this one.

All right. Now, normal random variables have some wonderful properties. And one of them is that they behave nicely when you take linear functions of them. So let's fix some constants a and b, suppose that X is normal, and look at this linear function Y.

What is the expected value of Y? Here we don't need anything special. We know that the expected value of a linear function is the linear function of the expectation. So the expected value is this.

How about the variance? We know that the variance of a linear function doesn't care about the constant term. But the variance gets multiplied by a squared. So we get these variance, where sigma squared is the variance of the original normal.

So have we used so far the property that X is normal? No, we haven't. This calculation here is true in general when you take a linear function of a random variable. But if X is normal, we get the other additional fact that Y is also going to be normal. So that's the nontrivial part of the fact that I'm claiming here. So linear functions of normal random variables are themselves normal.

How do we convince ourselves about it? OK. It's something that we will do formerly in about two or three lectures from today. So we're going to prove it. But if you think about it intuitively, normal means this particular bell-shaped curve. And that bell-shaped curve could be sitting anywhere and could be scaled in any way.

So you start with a bell-shaped curve. If you take X, which is bell shaped, and you multiply it by a constant, what does that do? Multiplying by a constant is just like scaling the axis or changing the units with which you're measuring it. So it will take a bell shape and spread it or narrow it. But it will still be a bell shape. And then when you add the constant, you just take that bell and move it elsewhere.

So under linear transformations, bell shapes will remain bell shapes, just sitting at a different place and with a different width. And that sort of the intuition of why normals remain normals under this kind of transformation. So why is this useful?

Well, OK. We have a formula for the density. But usually we want to calculate probabilities. How will you calculate probabilities? If I ask you, what's the probability that the normal is less than 3, how do you find it? You need to integrate the density from minus infinity up to 3.

Unfortunately, the integral of the expression that shows up that you would have to calculate, an integral of this kind from, let's say, minus infinity to some number, is something that's not known in closed form. So if you're looking for a closed-form formula for this-- X bar-- if you're looking for a closed-form formula that gives you the value of this integral as a function of X bar, you're not going to find it. So what can we do?

Well, since it's a useful integral, we can just tabulate it. Calculate it once and for all, for all values of X bar up to some precision, and have that table, and use it. That's what one does.

OK, but now there is a catch. Are we going to write down a table for every conceivable type of normal distribution-- that is, for every possible mean and every variance? I guess that would be a pretty long table. You don't want to do that.

Fortunately, it's enough to have a table with the numerical values only for the standard normal. And once you have those, you can use them in a clever way to calculate probabilities for the more general case. So let's see how this is done.

So our starting point is that someone has graciously calculated for us the values of the CDF, the cumulative distribution function, that is the probability of falling below a certain point for the standard normal and at various places. How do we read this table? The probability that X is less than, let's say, 0.63 is this number. This number, 0.7357, is the probability that the standard normal is below 0.63. So the table refers to the standard normal.

But someone, let's say, gives us some other numbers and tells us we're dealing with a normal with a certain mean and a certain variance. And we want to calculate the probability that the value of that random variable is less than or equal to 3. How are we going to do it? Well, there's a standard trick, which is so-called standardizing a random variable.

Standardizing a random variable stands for the following. You look at the random variable, and you subtract the mean. This makes it a random variable with 0 mean. And then if I divide by the standard deviation, what happens to the variance of this random variable? Dividing by a number divides the variance by sigma squared.

The original variance of X was sigma squared. So when I divide by sigma, I end up with unit variance. So after I do this transformation, I get a random variable that has 0 mean and unit variance.

It is also normal. Why is its normal? Because this expression is a linear function of the X that I started with. It's a linear function of a normal random variable. Therefore, it is normal. And it is a standard normal.

So by taking a general normal random variable and doing this standardization, you end up with a standard normal to which you can then apply the table. Sometimes one calls this the normalized score. If you're thinking about test results, how would you interpret this number? It tells you how many standard deviations are you away from the mean.

This is how much you are away from the mean. And you count it in terms of how many standard deviations it is. So this number being equal to 3 tells you that X happens to be 3 standard deviations above the mean. And I guess if you're looking at your quiz scores, very often that's the kind of number that you think about. So it's a useful quantity.

But it's also useful for doing the calculation we're now going to do. So suppose that X has a mean of 2 and a variance of 16, so a standard deviation of 4. And we're going to calculate the probability of this event. This event is described in terms of this X that has ugly means and variances. But we can take this event and rewrite it as an equivalent event.

X less than 3 is this same as X minus 2 being less than 3 minus 2, which is the same as this ratio being less than that ratio. So I'm subtracting from both sides of the inequality the mean and then dividing by the standard deviation. This event is the same as that event.

Why do we like this better than that? We like it because this is the standardized, or normalized, version of X. We know that this is standard normal. And so we're asking the question, what's the probability that the standard normal is less than this number, which is 1/4? So that's the key property, that this is normal (0, 1).

And so we can look up now with the table and ask for the probability that the standard normal random variable is less than 0.25. Where is that going to be? 0.2, 0.25, it's here. So the answer is 0.987.

So I guess this is just a drill that you could learn in high school. You didn't have to come here to learn about it. But it's a drill that's very useful when we will be calculating normal probabilities all the time. So make sure you know how to use the table and how to massage a general normal random variable into a standard normal random variable.

OK. So just one more minute to look at the big picture and take stock of what we have done so far and where we're going. Chapter 2 was this part of the picture, where we dealt with discrete random variables. And this time, today, we started talking about continuous random variables. And we introduced the density function, which is the analog of the probability mass function.

We have the concepts of expectation and variance and CDF. And this kind of notation applies to both discrete and continuous cases. They are calculated the same way in both cases except that in the continuous case, you use sums. In the discrete case, you use integrals.

So on that side, you have integrals. In this case, you have sums. In this case, you always have Fs in your formulas. In this case, you always have Ps in your formulas.

So what's there that's left for us to do is to look at these two concepts, joint probability mass functions and conditional mass functions, and figure out what would be the equivalent concepts on the continuous side. So we will need some notion of a joint density when we're dealing with multiple random variables. And we will also need the concept of conditional density, again for the case of continuous random variables.

The intuition and the meaning of these objects is going to be exactly the same as here, only a little subtler because densities are not probabilities. They're rates at which probabilities accumulate. So that adds a little bit of potential confusion here, which, hopefully, we will fully resolve in the next couple of sections.

All right. Thank you.