Home » Courses » Electrical Engineering and Computer Science » Discrete Stochastic Processes » Video Lectures » Lecture 23: Martingales (Plain, Sub, and Super)
Flash and JavaScript are required for this feature.
Download the video from iTunes U or the Internet Archive.
Description: After reviewing Wald's identity, we introduce martingales and show they include many processes already studied. Next, submartingales, supermartingales, and stopped (simple, sub, super) martingales are introduced.
Instructor: Prof. Robert Gallager
Lecture 23: Martingales (Pl...
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.
PROFESSOR: OK, I guess we're all set for getting close to the end, coming now to a race about whether we could say anything meaningful about Martingales or not. But I think we can. I want to spend a little time reviewing the Wald identity today and also sequential tests.
It turns out that last time on the slides-- I didn't get the thresholds confused-- I got hypothesis 0 and hypothesis 1 interchanged from the way we usually do them. And it doesn't make any difference. There's no difference between hypothesis 0 and hypothesis 1. And you can do it either way you want to.
But it gets very confusing when you switch from one to the other when you're halfway through an argument. So I'm going to go through part of that again today. And we will get revised slides on the web, so that if you want to see them with the hypotheses done in a consistent way, you will see it there. That should be on there by this afternoon I hope.
OK, so let's go on and review what Mr. Wald said. He was talking about a random walk. Random walk consists of a bunch of a sequence of IID random variables. The random walk consists of the sequence of partial sums of those random variables.
And the question is if this random walk is taking place and you have two thresholds, one at alpha and one of beta-- and beta is below 0 and alpha is above 0-- and you start at 0, of course, and the question is when do you cross one of these two thresholds and which threshold do cross? What's the probability of crossing each other, and everything else you can say about this problem.
And it turns out this is a very major problem as far as stochastic processes are concerned, because it comes up almost everywhere. It certainly comes up as far as hypothesis testing is concerned. It's probably the major problem there when you get into sequential analysis. It's the major problem there. So it's a very important problem.
And what Wald said was if you let the random variable J be the stopping time of this random walk, namely, the time in which the walk first crosses either alpha or crosses the beta, and then he said no matter what r you choose and the range of points where the moment generating function g sub X of r exists. Y You can pick any r in that range, and then what you get is this strange looking equality here.
And I pointed out last time it just wasn't all that strange, because if instead of using the stopping time of when you cross a threshold, if instead you used as a stopping time just some particular end. You go for some number of steps, and then you stop. And at that point, you have the expected value of E to the rsn. The expected value of E to the rsn by definition is the moment generating function at r of S sub n, which is exactly equal to the minus J times gamma n times gamma of r.
So all we're doing here, all that Wald did-- as it turns out, it was quite a bit-- was to say that when you replace a fixed end with a stopping time, you still get the same result. We're stating it here just for the case of two thresholds. Wald stated it in much general terms. We'll use it a more general terms, when we say more about martingales.
X, now remember is the underlying random variable. S sub n is the sum of the X's. If X bar is less than 0, and if gamma r star equals 0, r star is the r at which gamma of r equals 0. It's the second root of gamma of r.
Gamma of r, if you remember, looks like this. This is r star here. This is the expected value of X as the slope here. And we're assuming that X bar is less than 0 for this. I don't know how that greater than 0 got in.
And then what it says is the probability that SJ is greater than or equal to alpha is less than or equal to the minus alpha r star. And last time, remember, we went through a long, messy bunch of equations for that.
I looked at it again, and this is just the simple old Markov inequality again. All you do to get this is you say, OK, think of the random variable, E to the r SJ. SJ is a very complicated random variable, but it's a random variable, nonetheless. So either the rSJ is a random variable and the expected value of that random variable is at r star, the expected value of it is just one.
I'll write down. It'll be easier. The expected value e to the r star S sub J is equal to 1. And therefore, the probability that E to the r star SJ is a greater than or equal to E to the r star alpha is just less than or equal to 1 over E to the r star alpha, OK? And that's what the inequality says. So that's all there is to it.
OK, what?
AUDIENCE: I don't really see why these two [INAUDIBLE]? They don't [INAUDIBLE].
PROFESSOR: You need x1 negative so that you get another root so that r star exists. If r star is positive, if the expected value of x is positive, then r star is down here at negative r. I mean, you're talking about the other threshold in a sense.
OK, this is valid for all lower thresholds. And it's also valid for no threshold. OK, in other words, this equation here does not have beta in it at all. So this equation is an upper bound on the probability, that you're going to cross that threshold of alpha. And that upper bound is valid, no matter where you put the lower bound at all.
So you can go to the limit as the lower bound goes to infinity. And this inequality should still be valid. You have a homework problem where you actually prove that.
Sometimes when things go to infinity, funny things happen. And that proves that nothing funny happens then. So what happens then is the probability that you ever cross a threshold at plus alpha, when you have a random variable, which has a negative mean, is this exponent here.
And we also sort of showed by looking at the turn off bound that this bound is pretty tight. So in other words, what this is saying is when you're looking at threshold crossing problems-- this quantity here, this quantity where the second root of gamma of r is-- that's sort of the crucial parameter that you want to know. Usually the first thing you want to know about a random variable is its mean, its variance, all sorts of things like that.
This is saying if you're interested in thresholds, forget about all those things, look at r star. If r star is positive that means it means is negative, so there's no problem there. But this one quantity here is sort of the most important parameter of all of these problems.
OK, so let's go back to look at a hypothesis testing again, where we're looking at the likelihood ratio of being the ratio of the density for hypothesis 0 divided by hypothesis 1. What you get then is you observe this sequence Y sub n. These are the observations that you're taking.
In other words, nature at the beginning of this whole experiment chooses either H equals 0 or H equals 1. At that point, you start to make measurements. Now whether nature chooses H equals 0 before or after or when doesn't make any difference. The point is the experiment consists of nature choosing one of these two hypotheses.
You know all the probabilities that exist in the world in this model. You go making these measurements. All you of observe is these measurements. You don't observe what the hypothesis is, so you define this likelihood ratio of the ratio of the densities of the vector Y for H equals 0 and the vector Y with H equals 1.
These quantities exist no matter what the a priori probabilities of the thresholds are or anything else. Even without all of that, so long as you have a model which tells you what the densities of these observations are, conditional on each hypothesis, you can define this. This doesn't depend on a priori probabilities at all.
OK, so now you look at the probability that H is equal to 0, given all these observations divided by the probability it's equal to 1. What you get here now, you have the a priori probabilities, p0 over p1. Here is the likelihood ratio here. So what you have this p0 over p1 times the likelihood ratio of this vector of however many observations you have observed. It's just a nice way of breaking up the problem into the likelihood ratio and the a priori probabilities.
Incidentally, we haven't talked about this at all, but there's an important idea and all of this hypothesis testing of a sufficient statistic, and what do you think a sufficient statistic is. It's anything from which you can calculate the likelihood ratio. In other words, what we're saying here, the point we're making, is that any intelligent choice of hypothesis is it based on a threshold test on the likelihood ratio. And therefore, the only thing you can really be interested in in all your observations is just what is the likelihood ratio?
If they make all these 1,000 observations complicated sort of thing, you calculate one number. And that's the only thing you're interested in. And anything from which that number could be calculated is a sufficient statistic. And anything from which it can't be calculated you've thrown away some of the information that you have.
If you study communication and you study detection, you study how to receive data that's being sent, what you find is that right at the beginning, even before you do any detection, even before you do any filtering, there's some idea of a sufficient statistic there. That's what you need in order to calculate everything else. And you want to make sure that you have that. So that's an important idea there.
OK, but anyway, the MAP rule, which comes right from this, says if you have these a priori probabilities, and you're trying to maximize the probability of choosing correctly, what do you do? Well, your probability of H equals 0 was the correct hypothesis, given all the observations you made, is in fact this. The probability that H equals 1 is the correct hypothesis is this.
What do you do if you want to maximize the probability of being correct? You choose the one which is biggest. In other words, what you do is you look at this number. And if this number is bigger than 1, you choose 0. If it's less than 1, you choose hypothesis 1.
And what it turns out to is threshold of rule. You take this likelihood ratio. You compare it with p1 over p0. And in this case, you select h equals 0. In this case you select H equals 1.
And the last time I just a 1 and 0 reversed, which is fine, but if you reverse them one place, you want a reverse them every place. And every other threshold test does something like this, except you replace p1 over p0 with some arbitrary threshold. You say whatever reason you want to find for that threshold, that's the only intelligent kind of test you can make.
OK, then we define the log-likelihood ratio of the logarithm of the likelihood ratio. And that was nice because it was a sum of this quantity related to the individual observations. For each observation you really want to know what f of Y given H of Y given 0, divided by Y given 1. You want to divide those two. You want to take the logarithm of it. And then you have those numbers, and the sufficient statistic that you're interested in is just a sum of those numbers.
So you're looking at a sum of IID, random variable. IID, why IID? Well, under the hypothesis that H is equal to 1, those Yi's are IID. And therefore under the hypothesis that H is equal to 1. Well, little Zi is just a sample value. If you look at the random variable which has these sampled values, Z sub i, under the probability measure, corresponding to H equals 1, those Z sub i's are IID.
So what that says is when you look at these sums of random variables, the sum of Zi from 1 to n, under hypothesis H equals 1, what do you get? You get a random walk. You get a sum of IID random variables.
If you take more observations, S sub n just changes. With n changing, then you have a larger number of observations. So the random walk goes a little further out, and you might get closer to a threshold or whatever. And that's what we're trying to do here.
OK, so the Z sub i's under the hypothesis H equals 1, or IID, and the moment generating function of the Z sub i's given H equals 1, is this. Let's be careful about this. The sampled values of the Z sub i do not depend on the hypotheses at all.
Namely, you make an observation. You make an observation of Y sub i. You calculate Z sub i from Y sub i. That has nothing to do with whether H equals 0 or H equals 1. You try to calculate this moment generating function, however. And you want to know what the probability density of the Y's are. And you get a different probability density for H equals 1, then you get on the other hypothesis.
If the observations behaved the same way under both hypotheses, it wouldn't make much sense to do the observation. Unless you have a government grant, and you're trying to get money out of the government instead of trying to do anything worthwhile. Under those circumstances, you keep on making observations. You now perfectly well that nothing is going to come from them. But otherwise, it's a little silly.
So this moment generating function under the hypothesis H equals 1 is given by this quantity here. And this density here is the same as this density here. So you get this density to the 1 minus r power, and you get this density to the r power. So you get the product of these two densities. You integrate it over Y, and that's what gamma 1 of r is.
Now I said that the really important thing in all of these threshold problems is what is our star? And for this problem, r star is trivial. It's always the same. r star is always equal to 1.
And the reason is when you set r equal to 1 here, this quantity becomes 1. This quantity becomes the density of Y conditional on H equals 0. When you integrate that, you get 1.
So for all of these hypothesis testing problems, r star is equal to 1. Gamma 1 of 1 is equal to 0. And this is what this curve says. OK, this is gamma 1 of r here. This curve starts out here, negative slope. It comes up here. r star is equal to 1 in this case. And that's sort of the end of the story for that.
Now if you are doing a test with a fixed value of n, you say I'm going to make n observations, it's all I have time for. The week is over. I'm going on vacation next week. I've got to stop this test. I've got to write my paper. Take the end test. You write your paper.
And what do you do? You go through the optimal tests the best you can. And what you find is given H equals 1, an error is going to occur if the sum of random variables, namely the log-likelihood likelihood ratio, exceeds the logarithm of your threshold.
OK, this is whatever threshold you decide to establish. And we showed before that the probability that S sub n is greater than or equal to log of the threshold is evaluated as E to the n times this quantity right here. The probability of error given H equals 1 is this quantity here.
Probability of error given H equals 1 is the probability that the data looks like H equals 0 was a right hypothesis. In other words, that you crossed the threshold at plus alpha, instead of crossing the threshold at beta. Excuse me. We have too many cases here we're looking at, so it gets confusing. What I'm looking at here is the probability that the log-likelihood ratio exceeds this threshold data, whatever we set beta to be.
Eta is set, depending on the cost of making errors of both types on our a priori beta, if we have any and all of those things. And the probability of error given H equals 1 is this quantity here, which has the threshold in it over there. We've looked at that a number of times in a lecture. We looked at it in chapter one. And then those, we looked at it in chapter seven.
And you calculate it by taking this moment generating function, drawing attention to it at the point where slope natural log of eta divided by n. And then you take where it comes in to this vertical axis here. And that's the exponent of the error of probability when hypothesis 1 is correct.
Now, if the hypothesis is H equals 0 instead, at that point with H equals 0, the expected value of this log-likelihood ratio is going to be positive. The situation is going to be a curve that comes over here, comes back at some point here. And what we've showed is that this curve is just a translation of this curve by 1.
OK, namely if you calculate the moment generating function for H equals 0, you get the same thing that we got before. I'm not going to go through all the details of this. Now you have 0 here instead of 1.
Over here you're going to have just minus r. And over here you're going to have 1 plus r. So this whole thing is translated by 1. The action happens here. But if you translate it by 1 over in this direction, what happens is the error of probability is determined by this.
It's this. The exponent is this point right here, gamma 1 of r0 plus 1 minus r0, log of eta over n. And the r0 was again determined by this point at which the slope is equal to log eta over n.
We did that before. Then I want to make clear you understood it, because to really understand it, you have to go through the arithmetic yourselves at least once. And you can do that easily by following the notes, because it does it in almost excruciating detail.
So that's the argument you get. We had this idea before, of the Neyman-Pearson principle, which says you don't assume a priori probabilities. You look at the probability of making an error as being a trade off between the error you make when H is equal to 1 and the error you make when H is equal to 0.
In terms of the Chernoff bound, this trade off is very clear. As you change the exponent that you want to get under H equals 1, this point moves. The tangent then moves. And the exponent over here moves.
So you have this inverted seesaw. And the exponent for one kind of error is over here. And the exponent for the other kind of error is over there. Then the next thing we said was this is really stupid, unless you're going on vacation this Friday. If you're not going on vacation this Friday, if you're really serious about making the right decision, then what you're going to do is keep on making observations until you're pretty sure you're right.
Now somebody at the end of the lecture last time pointed out something, which says that when you do experiments and you keep on making observations until you get the data that you want, there's something very unethical about that. Is this that kind of unethical behavior? Or is this really valid?
Well, I claim this is valid, because what we're doing when we're doing sequential testing is we're deciding what we're going to do ahead of time. Namely, we've decided what we're going to do is we're going to continue testing until we cross a threshold and threshold gives us a suitable probability of error. So we're not cooking the books at all. What we're doing is we're following this preset procedure we've set up. And the only question is can we get a very small error probability by using a smaller number of observations on the average than what we need otherwise?
Put it in terms of a communication system. One kind of communication system, you have to send some data from one point to another. You're not going to get any feedback on it. You've got to get the data through the first time. It's got to be right. What are you going to do?
You're going to send this data a very large number of times or use a very powerful coding technique on it. And by time it gets through, you're going to be very sure you're right. Now a much better procedure, and the thing which is used in almost all communication systems, and the thing which we use as human beings all the time, and the thing which control people use all the time, the thing which almost everybody uses, because most of us have common sense if we spend some time trying to do these things, is instead of trying to get it right the first time, we try little bit to get it right the first time. And we make sure that if we don't get it right the first time, we have some way of finding out about it and getting it right the second time.
And in the scientific way of looking at it, what we do is we decide ahead of time exactly what our procedure is going to be for making repetitions-- something called ARQ in communication systems, which means automatic repeat request. It's automatic, which means you don't try to make your decision depending on whether you'd like to receive this 0 or like to receive a 1. You make the decision ahead of time that if you have a clean enough answer, you're going to accept it. If it looks doubtful, you're going to send it over again. That's exactly the same sort of thing we're doing here.
OK, when we do that, given H equals 1, we again have this S sub n as a function of n as a random walk. It's a sum of IID random variables and conditional on H equals 1. You have a random walk.
Conditional on H equals 1, you have a negative slope on this random walk. The random walk starts out and on the average is going to go down, and it's going to continue going down forever. And if you're looking for across some positive threshold, if it doesn't cross it pretty soon, it's not going to cross it.
But anyway, we have a test which said we have some positive threshold. We have some negative threshold. If we ever cross the positive threshold, we say H is equal to 0. If we ever cross the negative threshold, we say H is equal to 1. And then we're done with it.
OK, now, let me give you another argument why that makes sense. I gave you one argument last time. I'll give you another argument this time. If S sub J is greater than or equal to 0, we're going to decide that 0 is the correct hypothesis.
If H equals 1 is the correct hypothesis, then we're going make an error when S sub J is greater than or equal to alpha. If S sub J is less than or equal to the beta, we're going to decide H equals 1. And conditional on H equals 1, an error is made if SJ is greater than or equal to alpha. Conditional on H equals 0, an error is made if SJ is less than or equal to beta.
OK, so the probability of the error conditional on H equals 1 is the probability that S sub J is greater than or equal to alpha, given H equals 1, which is less than or equal to E to the minus alpha or star. This is the thing that we said before. r star is the root of gamma of r. And gamma of r is equal to this.
OK, so, let's just make life a little easier for ourselves assume that our a priori probabilities are each 1/2. This is also called maximum likelihood decision. You take this likelihood ratio, and you just decide on the basis of the likelihood ratio.
OK, then at the end of trial end, the probability of H equals 0 given Sn divided by the probability of H equals 1 given Sn, the a prioris cancel out. It is just E to the S sub n. That's what it is.
It's the likelihood ratio. S sub n is the log-likelihood ratio. So this is what it is. If you now take probability of H equals 0 on probability that H equals 1 given S of in, this equation becomes this equation. And then the probability of H equals 1 given S sub n is just E to the minus Sn over 1 plus E to the minus Sn.
Now if Sn is a large number, E to the minus Sn is going to be totally trivial. And the probability that H equals 1 given Sn is essentially E to the minus Sn. It means when you can choose different values of n, this very directly gives you a control on what the probability of error is.
The probability of error is essentially E to the minus Sn. So if you choose a threshold alpha, what you're doing is you're guaranteeing that the probability of error cannot be less than E to the minus alpha. OK, so this is more than just talking about averages. This is saying if you use a threshold rule, then what you're doing is guaranteeing that the probability of error is never going to be less than this quantity of specified here.
OK, we saw last time the cost of choosing alpha to be large is that you have to make a very large number of trials, at least given H equals 0. Why don't I worry about the number of trials for H equals 1? I mean, it's nothing to be thought through here. If my thresholds are large, my probability of error is very small. The expected values of things for very large log-likelihood ratios are determined almost entirely by H equals 0.
H equals 1 sometimes. You sometimes make a mistake, because it's something very, very unusual. But that has very little influence on the expected number of tests you're making. So what happens then is the expected number of tests you make under the hypothesis that H is equal to 0-- now we're using Wald's equality rather than Wald's identity-- it's equal to the expected value of S sub J given H equals 0, divided by the expected value of Z, given H equals 0. Z is the log-likelihood ratio of one trial.
This is just Wald's equality with this condition thrown into it. Now what's the expected value of SJ given H equals 0? It's essentially alpha, and if you want to be more careful, it's alpha plus the expected overshoot given H equals 0. And that's divided by the expected value of Z, given H equals 0.
This is the answer we got last time. So the number of tests you have to make, if you set a positive threshold alpha, is essentially the number of tests you have to make when the hypothesis is equal 0. So the funny thing which is happening here is that as you change alpha, you're changing the probability of error for hypothesis H equals 1. And you're changing the number of tests you're going to have to do when H is equal to 0.
When you change beta, it's just the opposite. So that when you change beta, if you make beta a very large negative, you have to make an enormous number of tests under the circumstance that H is equal to 1. But you might make an error when H is equal to 0. So the trade off is between number of trials under one hypothesis, error of probability under the other hypothesis.
That's almost all we wanted to say about Wald's identity. There's one other huge thing that we want to talk about. If you take the first two derivatives of Wald's identity at r equals 0, you get some interesting things coming out. I mean, Wald's identity, you can use it any value of r you want to. And when you use it for a large value of r, you that an interesting result about large deviations. When you use it at a small value of r, you get something more about typical cases.
So looking at it at r equals 0, what you want to do is you want to take the derivative with respect to r of Wald's identity. This expected value in here we know is equal to 1. It's equal to 1 whatever value of r we choose. And therefore, when we take the derivative of this, we have to get 0. But we also want to take the derivative of it to see what we get.
So when you take the derivative of this quantity here and you don't worry about what exists and what doesn't exist-- you have to take the derivative here-- so you get an S sub J there. You take the derivative here, you get a gamma prime of r there. If you get SJ minus J times gamma prime of r, and this E to the what have you just sits there. You take the derivative of E to something, you never get rid of the E to something. You just get piled up stuff in front of it.
OK, so when we evaluate that at r equals 0, what happens? Well, what's the value of the gamma prime of 0? It's the expected value of X, yes. And this quantity here is all equal to 1, so we can forget about that. When r is equal to 0, this is equal to 0. When r is equal to 0, gamma of r is equal to 0.
So this whole thing in here is 0. So E to the 0 is 1. So we've got a 1 there. We got expected value of S sub J minus J times X bar is equal to 0.
What is that? That's Wald's equality. So Wald's equality falls out of the Wald's identity as what happens as the derivative of Wald's identity that r equals 0. Well, since we're so successful with that, let's go on and take another derivative. Yes?
AUDIENCE: I guess you want the final equal to 0 [INAUDIBLE].
PROFESSOR: Oh, the final equal to 0 comes from the fact that this quantity here that you're starting with is equal to 1 for all values of r. Therefore, I want to take the derivative with respect to r, I get 0. So that's one equation. The other thing is I just go through the mechanics of taking the derivative.
OK, so let's try to take the second derivative. Take the second derivative by taking the derivative of the first derivative. And what happens is then is this quantity in here I get an extra term of that sitting over there. And along with that, I get the derivative of this with respect to r.
I should probably have written that down there but since I didn't, let me see if I can do it. I get the expected value of SJ minus J gamma prime of r. And this quantity is squared now, because I have this there. I'm taking the derivative of this term with respect to r.
And also, I have to take the derivative of this with respect to r. So that gives me minus J times gamma double prime of r. And all of this times E to the r SJ minus J gamma of r.
Now I want to evaluate this at r equals 0. Evaluating this at r equals 0, this term goes away. So I wind up with the expected value of SJ minus J gamma prime of r where minus J gamma double prime of r is equal to 0.
Well, this doesn't look bad. But if you try to use it if you expand this term here, you get a term the expected value of S of J times J. And you can struggle with that. And it's ugly. That's very ugly.
But if you now say, if we have a mean we can use Wald's equality. It tells us what we want to know. If we don't have a mean, then Wald's equality doesn't tell us anything. But this is going to tell us something.
So we're going to make the assumption here that r is equal to 0 and X bar is equal to 0. And if X bar is equal to 0, gamma prime of 0 is equal to 0. And gamma double prime of 0 is equal to sigma squared of X.
So you do all of that. What you get is the expected value of S sub J squared minus sigma X squared of J is equal to 0. This is the same kind of thing that we got from Wald's equality.
From Wald's equality, it didn't tell us anything. It just gave us a relationship between the expected value of S sub J and expected value of J. This is doing the same thing. It's giving us a relationship between the expected value of S sub J squared and the expected value of J. So we get the same kind of quantity. It's doing the same thing for us.
Now you look at this for a 0 means simple random walk. Now you would have thought before you started to take this class that a simple random walk with mean 0 was the simplest thing in the world. And we've seen by looking at that it really isn't all that simple, that you come play these silly games like you can gamble forever, where with probability 1/2 you lose $1, with probability 1/2 you win $1-- perfectly fair game.
And with probability one, you to make $1 out of that, and quit and go home. And since you to make $1 out of it, and quit and go home, you can then quickly come back again and it again.
You can make $2. You can make $10. You can make $1,000. You can make $1 million with probability 1. So the simple random walk is no longer simple. It becomes puzzling.
But Wald's identity is dealing with two thresholds, one of alpha and one at beta. When we apply this and you observe it as a simple random walk, where you either go up by 1 or go down by 1, each with probability 1/2, the mean of X is 0 and the variance of X is 1. So this quantity here is 1.
You can then play games with what the probability is that you hit the upper threshold and the probability that you hit the lower threshold. I mean, it's done in the text. You don't have to take my word for it.
And when you do that, what you find is the expected value of J is equal to minus beta times alpha. Theta is a negative number, remember, so this is expected value of J is the magnitude of theta times the magnitude of alpha.
Now that's a little bizarre, but then you think about it a little bit. You think what happens. And this is really exact. I mean, this isn't an approximation or anything.
If alpha is very large, and beta is very large and negative, and you play this random walk game, you're going to fluctuate a long time. You're going to disperse slowly. You're going to disperse according to the square root of n, or the number of tests you take. So the amount of time it takes you until you get way out to these thresholds should be-- to the namely the value that n has to have-- roughly the square of alpha when beta and alpha are both the same.
This is something more general than that. It says that if Sn, the stop-when-you're-ahead game, we make alpha equals 1, the expected value of J depends on what the lower threshold is. And that suddenly makes sense, because what that's saying is if we have a lower threshold at 10, an upper threshold at one, then most of the time you win.
But when you lose, you lose $10. When you win, you win, $1. When you set a lower threshold at 100, when you lose, you lose $100. When you win, you win $1.
And suddenly, that stop-when-you're-ahead game does not look quite as attractive as it did before. What you're doing is taking a chance where you're probably going to win of winning $1, and you're risking your life's assets for it, which doesn't make too much sense anymore. OK this, I think, it gives you a better idea of what's going on on the simple random walk than anything else I've seen.
So it's time to start talking about martingales. A martingale, like most of the other things we've been talking about in the course, is a sequence of random variables. This is a more general kind of sequence than most of them.
Almost all of the processes we've talked about so far have been the kinds of things you can sort of get your hands on. And this is defined very abstractly in terms of a peculiar property that it has. And then the peculiar property it has is the expected value as the nth term, in this thing called a martingale, conditional on knowing the values of all the previous values, expected value of Z sub n given the value of Z and minus 1, Z and minus 2, all the way down to Z1 is equal to Z sub n minus 1. Namely, the expected value here is what you had there.
The word martingale comes from gambling, where gamblers used to spend a great deal of time trying to find gambling strategies when to stop, when to start betting bigger, when to start betting smaller, when to do all sorts of things, all sorts of strategies for how to lose less money. Let me put it that way, because you rarely find that opportunity where you can play a fair game. But if you play a fair game, martingales are what sort of rules on that.
And what that says is if you play this game for a long time, your capital is Z sub n minus 1. This says, figure expected capital after you play one more time. No matter what strategy you use, your expected capital is going to be the same as was as the actual capital the time before.
If this is too abstract for you, and it's too abstract for me half the time, because I look at this, and I say, gee, that's not much of a restriction, is it? What we're talking about is expected values here. But it's more than that, because it's saying for every choice of sample value for all of these things, none of them make any difference, except the last one.
And that's what happens in gambling. It doesn't make any difference how your capital has gotten to the point where it is at time n minus 1. You make a bet in a fair bet, and what you win is solely a function of what you've bet, if the game is fair. And that's what this is saying.
So when you write it out this way, the expected value of Zn, given that 1, the random variable Zn minus 1 has a particular value Zn minus 1. You started out with a particular value. It says that expected value is equal to what you said at the last time. And this is true for all sample values Zn minus 1 down to Z1, which is why it's a much stronger statement than it appears to be. now there's a lemma. I want to talk about that a little bit, because it's a good time to get you used to what these expected values mean. For martingale, the expected value of Zn given Zi, Zi minus 1, all the way down to Z1. This expected value is equal to Z sub i.
In other words, it's not only that your expected capital, given all of the past, is equal to what you had on the last time instant. If you're not given anything for 100 years back, and all you know is what your capital was 100 years ago, and if we think we're playing a fair game all of this time, which is of course always a question, the expected value of what we have now, conditional on everything from 100 years back through recorded history, is just that last term. In other words, it's the same kind of isolation of the past from the future as we had with Markov change.
With Markov change, remember, it's only what happens at one instant, given what happens a one instant, it makes the past independent of the future. Here it's not quite that way, because the past and the future are separated only in terms of the details of the past, and the expected value the future. It says expected value of Z sub n given all of the details of the past, no matter what the details of the past are, the effective value of Zn is equal to the actual value at times Z sub i.
So I want to improve this for you, and I warn you you're not going to follow this proof. And that's part of the reason for me to do it, because I want you to go back to Chapter 1 and think that through. Because in dealing with martingales, you have to think this through.
Because if you don't think it through, you're stuck with this notation all the way through. And if you try to use this notation with martingales, this is nice notation when you get confused, but you don't want to use it all the time. So you have to be able to go through arguments like this.
What I want to show is that if E to the Z3, given Z1 and Z2 is equal to Z2, then one special case of this lemma is that expected value of Z3 given Z1 is equal to Z1. And how do we show that?
Well, what we do is we use this law complete expectation. Well, first we remember the expected value of an arbitrary random variable X is the expected value of the expected value of X given Y. Now what does that mean?
The expected value of the random variable X, given Y, is a random variable. It's a random variable which depends on Y. That's a function of the sample value of Y. Namely, if you look at this quantity up here, expected value of X given Y equals 1. Expected value of X given Y equals 2. Expected value of X given Y equals 3.
We have all of these values here. We have a probability measure on it. This is a random variable, which is a function of Y. You've averaged that over X, but you're left move why because of the conditioning here.
So this quantity in here is now a function of Y. So when we take this equation and we add the conditioning on Z1, namely, this is being used for Z3 and Z2. Expected value of Z3 is equal to the expected value over Z2 of the expected value of Z3 given Z2, whole thing dependent on Z1.
OK, so what it says is this expected value is the expected value of the expected value of Z3 condition on Z2 and Z1. This quantity here as a function of what? That's a random variable. It's a function of what random variables?
AUDIENCE: Z2, Z1.
PROFESSOR: Z1 and Z2, yes. So this is a function of Z1 and Z2. What value is it as a function of Z1 and Z2? It's just equal to Z2.
So this quantity in here is Z2. so we're asking what's expected value of Z2 given Z1. And by definition of martingale, it is equal to Z1.
Now I imagine about half you could follow that, and half of you couldn't, and half of you sort of followed it. This is a kind of argument we'll be using all the way through on this stuff. So make sure you understand it. I mean, once you get it, it's easy. And you can apply it in all sorts of places. So it's worth doing it.
In the same way, you can follow the same kind of argument through the expected value Z sub i plus 2, using this total expectation based on Zi plus 1. And you go through the whole thing. When you go down to i equals 1, it says the expected value of z is equal to the expected value of Z1.
If you want to become wealthy, have a wealthy parent, who leads to a lot of money 20 years ago. That's the easiest way to make a million dollars is to start out with 2 million dollars is the way that some people put it.
OK, let's have some simple examples a martingales. One of them is a zero-mean random walk. Mainly, what I'm trying to do here is to show you this martingales are really pretty general things. And since there are many very general theorems that hold for all martingales, you can then apply them to all of these special cases, which is kind of neat.
To have a zero-mean random walk, let Z sub n be the sum of X1 plus Xn and the X sub i's are IID and zero mean. The fact that they're IID makes it a random walk. The fact that there's zero mean makes it a special zero mean random walk, and the expected value of Z sub n, given Zn minus 1.
All the way back, Zn now is Xn plus Zn minus 1. OK, Zn is the sum of all these random variables. So you get up n minus 1 of them, and then you add the last one in. So it's Zn minus 1 plus Xn, so its expected value of Xn plus Zn minus 1, given all the stuff before that.
The expected value of Xn, given all this stuff, is what? Xn is independent of all the other X's, therefore it's independent of all the earlier Z's. And therefore, that's just expected value of Xn. So we have the expected value of Zn minus 1, given Zn minus 1 back to Z1.
What's expected value of Zn minus 1, given Zn minus 1? Well, it's Zn minus 1. That's no problem there. So this is 0. So this is equal to Zn minus 1, as it's supposed to be.
All of these things you ought to go back and think them through yourself. Because the first time you look at martingales, all of this stuff, it's all pretty easy, but it all looks a little strange at first. The next one is sums of arbitrary dependent random variables. They're not quite arbitrary.
Suppose you have a sequence of random variables, X sub i, i greater or equal to 1. And they satisfy the expected value of Xi, given all the earlier X of i's is equal to 0. It's similar to what a martingale is. But here, we're just saying the Xi's all have expected value of 0.
And Zn, the sum of these, has to be a martingale. And I'm not going to write all a proof of that. I mean, this proof is really the same as this proof.
This is really a pretty important thing, because given any martingale, you can always look at the partial sums between the terms of the martingale-- namely, given Z1, Z2, up to Z sub n. You can always look at Z2 minus Z1. You can look at Z3 minus Z2. You could at Z4 minus Z3, and so forth.
And each of the Z's is just the sum of those other random variables. So given any martingale in the world, you can always define the set of arbitrary depending random variables would satisfy this rule here. So Zn, in this case, is a martingale. And if Zn is a martingale, you can always define a set of random variables, which satisfy this property.
I think it's almost easier to see what a random variable really has to do with gambling, which is where it started, by looking at this. This is not your capital at time n. This is how much you win or lose at time i. And what it's saying is your winnings or losings at time i has zero mean independent of everything in the past.
In other words, in a fair game, you can bet whatever you want to and depending on what you bet, that's the expected amount you get on that trial. And that's what this says. This says essentially, you're applying a fair game. So martingales really have to do is fair games.
If you can find fair games, why, that's great. But we always look for games where we have an edge. But what you want to avoid is games where Las Vegas has an edge.
OK so, that's a general one. Here's an interesting one, because I think this is an example which you can use. I mean, in any field you study, there are always generic examples, which can be used to generate counter examples to any simple thing you might want to think of. And this is to me the most interesting one of those for martingales.
Suppose that Xi is the product of two random variables-- one is either plus 1 or minus 1, each with probability 1/2. And the other one, Y sub i is anything it wants to be. I don't care what Y sub i is. Y sub i is non-negative, might as well make it non-negative. I don't care about it. I don't care how it's related to all the other Y sub i's. All I want is the that the U sub i's are all independent of all the Y sub i's.
And what happens then? I take the expected value of X sub i, give it anything in the past, and what do I get? U sub i is independent of Y sub i. And therefore, the expected value of U sub i times Y sub i is expected value of U sub i, which is what-- plus 1 or minus 1 of probability 1/2 each.
The expected value of U sub i is equal to 0. That makes expected value of X sub i of 0 whatever the past is. So you automatically have the-- I don't know what to call in them, the terms between the terms of a martingale-- the interarrival terms, so to speak.
I mean, it's like those for a renewal process. Those terms always have mean 0. And therefore, the sums of these turn out to be this simple kind of martingale. So that's a nice martingale to use as counter examples for almost anything.
The next one is product for martingales. Product for martingales are things we use quite a bit too, because now when we're using generating functions, we're in the habit of multiplying things together. And that's a useful thing to do.
So the expected value of Z to the n, given Z to the n minus 1, now to Z1, where Zn is this product of terms. OK, Z sub n then is equal to Xn times Z sub n minus 1, which is what we're doing here. Expected value of Zn conditional on the past is the expected value of Xn times Zn minus 1, conditional on the past, Xn and Zn minus 1.
Oh, the expected value of Xn for any given value of Zn minus 1, all the way back, is just the expected value of X sub n. So we have expected value of X sub n times the expected value of Z sub n minus 1, given Zn n minus 1 down to Z1. So that's just Zn minus 1.
Ah, the missing quantity, fortunately I wrote it here. The X sub i's are unit means random variables. And they're IIDs. They're independent of each other. And since the X sub i's are independent of each other, X sub n is independent of Xn minus 1, all the way back to X1.
Zn minus 1 back to Z1 is a function of Xn minus 1, down to X1. So Xn is independent of all those previous Z's also. That's why I could split this apart in this way. And suddenly I wind up with Zn n minus 1 again. So product form martingales work.
Special form of product form martingales-- this again is favored counter example for when you can and can't get around with going to limits and interchanging limits. And it's a simple one.
Suppose that X sub i's are IID, as in the previous example. And they're [INAUDIBLE] probably 2 or 0. I mean, this is a game you often play-- double or nothing.
You start out with dollar. You play the game. If you win, you get $2. If you lose, you're broke.
If you win, you play your $2. If you win again, you have $4. If you lose, you're broke.
You play again. If you win, you have $8. If you lose again, you're broke.
So the probability that Z sub n, which is your capital after n trials, is equals to 2 to the n, namely you've won all n times, is 2 to the minus n. And every other instance you've lost. So the probability that Zn is equal to 0 is 1 minus 2 to the minus n.
So for each n, if you calculate the expected value of Z sub n, it's equal to 1. Namely, with probability 2 to the minus n, your capital is 2 to the n, so that's 1. With all the other probability, you have nothing.
So your expected value of Z sub n is always equal to 1. That's what this product form martingale says. And this is a product form martingale.
However, the limit as n goes to infinity of Zn is equal to 0 with probability 1. If you play double or nothing, eventually you lose. And then you're wiped out.
In other words, there's no real purpose to playing the game, because eventually you lose. If you're playing with somebody else, and they're playing double or nothing, then of course you get their money eventually. Or you go broke and the bank that you bank at fails, and all that stuff. We won't worry about that.
OK, but the point of this is that the limit as n goes to infinity of Zn is equal to 0. And the limit of the expected value of Z sub n is equal to 1. And therefore, the limit of Zn and the expected value of the limit of Zn.
The expected value of the limit of Zn is 0. The limit as expected value of Zn is equal to one. So this is a case where you can't interchange limit and expectation. It's an easy one to keep in mind, because we all know about playing double or nothing.
Might as well define submartingales and supermartingales. Because the first thing to know about them is they're like martingales, except they're defined with inequalities. And for a submartingale, the expected value of Zn, given all the previous terms, is greater than or equal to Zn minus 1.
So submartingales go up. Supermartingales are the opposite. Supermartingales goes down.
What else could you expect from a mathematical theory? Things that should go up, go down. Things that should go down, go up. Only thing you have to remember about submartingales and supermartingales is you figure out what terminology should have been used, and you remember the terminology they use was the opposite of what they should use.
I don't know whether I've ever seen stupider terminology than this. And someone once explained the reasoning for it, and the reasoning was stupid too. So there's no excuse for that one.
We're only going to refer to submartingales in what we're doing, partly because that's where most of the neat results are. And the other thing is if you have to deal with a supermartingale, what you might as well do is instead of dealing with a sequence-- Z1, Z2-- deal with the sequence minus Z1, minus Z2, and so forth. And if you change the sign on all the terms, then you change supermartingales into submartingales and vice versa. You don't really have to deal with both of them.
Let me talk briefly about an inequality that I'm sure most of you heard of. How many people have heard of Jensen's Inequality? Maybe half of you, so not everyone. Well, it's one of the main work horses of probability theory. Even though we haven't seen it yet this term, you will see it many times.
So what a convex function is. A convex function in simple minded terms is something, a convex function from r into r. A real value convex function is a function which has a positive second derivative everywhere. So it curves down and comes back up again. Since you also want to talk about functions which don't have second derivatives, you want something more general than that. So you go from derivatives.
You go back to your high school ideas, and you draw a picture. And function is convex. If all the tangents to the curve lie not strictly below, but all the tangents can curve lie beneath the curve, so wherever you draw a tangent, you get something which doesn't cross the curve.
Magnitude of X is a convex function of X. Magnitude X as a function of X looks like this. You go down or up. And all tangents to this, this goes off to infinity and this goes off to infinity. So there's no way to get something like that in this tangent.
So you have one tangent here. You have a bunch of tangents along here. And you have one tangent there. And they all lie below the curve. So X bar is a convex function too.
And Jensen's Inequality says if H is convex, and if Z is a random variable, it has finite expectation, then H of the expected value of Z is less than or equal to the expected value of H of Z. You can interchange expected value and function with inequality like this, if the function is convex.
Now, why is this true? You can see why it's true automatically, if you're dealing with a random variable that has only two values. If you have two values for the random variable-- Z is a random variable here.
You have one variable here, one value here, one sample value here. Look at what the expected value of H of Z is. The expected value of H of Z is the expected value of this and this with the appropriate probability put in on it, so at some point that lies on the straight line between here and there.
When you look at the H of the expect the value, then you find the expected value along here. You can think of finding it along the straight line here. And then it's that point there. So since the curve is convex, the H of the expected value of Z is there's always a bunch of points. Average them, which lie on a straight line beneath the curve.
And the expected value of H of Z is taking the average directly along the curve. So you get this boosting up everywhere. It's like saying that the absolute value of expected value of Z is less than or equal to the expected value of the absolute value of Z.
And I think I will stop there instead of going on, because we had a lot of new things today. And somehow sequential detection always wears one's mind out in a short period of time. That should be enough.
Martingales (Plain, Sub, and Super) (PDF)
MIT OpenCourseWare makes the materials used in the teaching of almost all of MIT's subjects available on the Web, free of charge. With more than 2,200 courses available, OCW is delivering on the promise of open sharing of knowledge. Learn more »
© 2001–2015
Massachusetts Institute of Technology
Your use of the MIT OpenCourseWare site and materials is subject to our Creative Commons License and other terms of use.