PROFESSOR: The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.
So welcome back. So we are now moving to a new chapter, which is going to have a little more of a statistical flavor when it comes to designing methods, all right? Because if you think about it, OK-- some of you have probably attempted problem number two in the problem set. And you realize that maximum likelihood estimators does not give you super trivial estimators, right? I mean, when you have an n theta theta, then the thing you get is not something you could have guessed before you actually attempted to solve that problem. And so, in a way, we've seen already sophisticated methods.
However, in many instances, the maximum likelihood estimator was just an average. And in a way, even if we had this confirmation for maximum likelihood that indeed that was the estimator that maximum likelihood would spit out, and that our intuition was therefore pretty good, most of the statistical analysis or use of the central limit theorems, all these things actually did not come in the building of estimator, in the design of the estimator, but really in the analysis of the estimator. And you could say, well, if I know already that the best estimator is the average, I'm just going to use the average. I don't have to, basically, quantify how good it is. I just know it's the best I can do.
We're going to talk about tests. And we're going to talk about parametric hypothesis testing. So you should view this as-- parametric means, well, it's about a parameter, like we did before. And hypothesis testing is on the same level as estimation. And on the same level as estimator will be the word "test," OK? And when we're going to devise a test, we're going to actually need to understand random fluctuations that arise from the central limit theorem better, OK? It's not just going to be in the analysis. It's also going to be in the design. And everything we've been doing before in understanding the behavior of an estimator is actually going to come in and be extremely useful in the actual design of tests, OK?
So as an example, I want to talk to you about some real data. I will not study this data. But this data actually exist. You can find it on R. And so, it's the data from the so-called credit union Cherry Blossom Run, which is a 10 mile race. It takes place every year in D.C. It seems that some of the years are pretty nice.
In 2009, there were about 15,000 participants. Pretty big race. And the average running time was 103.5 minutes, all right? So about an hour and a half or a little bit more.
And so, you can ask the following question, right? This is actual data, right? 103.5 actually averaged the running time for all of 15,000. Now, this in practice, may not be something very suitable. And you might want to just sample a few runners and try to understand how they're behaving every year without having to collect the entire data set.
And so, you could ask the question, well, let's say my budget is to ask for maybe 10 runners what their running time was. I still want to be able to determine whether they were running faster in 2012 than in 2009. Why do I put 2012, and not 2016? Well, because the data set for 2012 is also available. So if you are interested and you know how to use R, just go and have fun with it.
So to answer this question, what we do is we select n runners, right? So n is a moderate number that's more manageable than 15,000. From the 2012 race at random. That's where the random variable is going to come from, right? That's where we actually inject randomness in our problem.
So remember this is an experience. So really in a way, the runners are the omegas. And I'm interested in measurements on those guys. So this is how I have a random variable. And this random verbal here is measuring their running time. OK. If you look at the data set, you have all sorts of random variables you could measure about those random runners. Country of origin. I don't know, height, age, a bunch of things. OK. Here, the random variable of interest being the running time. OK. Everybody understand what the process is?
OK. So now I'm going to have to make some modeling assumptions. And here, I'm actually pretty lucky. I actually have all the data from a past year. I mean, this is not the data from 2012, which I also have, but I don't use. But I can actually use past data to try to understand what distribution do I have, right? I mean, after all, running time is going to be rounded to something. Maybe I can think of it as a discrete random variable. Maybe I can think of it as the exponential random variable. Those are positive numbers. I mean, there's many kind of running times that could come up to mind. Many kind of distributions I could think of for this modeling part.
But it turns out that if you actually plug the histogram of those running times for all 15,000 runners in 2009, you actually are pretty happy to see that it really looks like a bell-shaped curve, which suggest that this should be a Gaussian. So what you go on to do is you estimate the mean from past observations, which was actually 103.5, as we said. You submit the variance, which was 373. And you just try to superimpose the curve with this one, which is a Gaussian PDF with mean 103.5 and variants 373. And you see that they actually look very much alike. And so here, you're pretty comfortable to say that the running time actually is Gaussian distribution. All right?
So now I know that the x1 to xn, I'm going to say they're Gaussian, OK? I still need to specify two parameters. So what I want to know is, is the distribution the same from past years, right? So I want to know if the random variable that I'm looking for-- if I, say, pick one. Say, x1. Does it have the same distribution in 2012 that it did in 2009? OK.
And so, the question is, is x1 has a Gaussian distribution with mean 103.5 and variance 373? Is that clear? OK.
So this question that calls for a yes or no answer is a hypothesis testing problem. I am testing a hypothesis. And this is the basis of basically all of data-driven scientific inquiry. You just ask questions. You formulate a scientific hypothesis.
Knocking down this gene is going to cure melanoma, is this true? I'm going to collect. I'm going to try. I'm to observe some patients on which I knock down this gene. I'm going to collect some measurements. And I'm going to try to answer this yes/no question, OK? It's different from the question, what is the mean running time for this year?
OK. So this hypothesis testing is testing if this hypothesis is true. The hypothesis in common English we just said, were runners running faster? All right? Anybody could formulate this hypothesis.
Now, you go to a statistician. And he's like, oh, what you're really asking me is x1 has a Gaussian distribution with mean less than 103.5 and variance 373, right? That's really the question that you ask in statistical terms. And so, if you're asking if this was the same as before, there's many ways it could not be the same as before. There's basically three ways it could not be the same as before.
It could be the case that x1 is in expectation to 103.5 So the expectation has changed. Or the variance has changed. Or the distribution has changed. I mean, who knows? Maybe runners are now just all running holding their hands. And it's like now a point mass at 1 given point. OK. So you never know what could [INAUDIBLE]. Now of course, if you allow for any change, you will find change. And so what you have to do is to factor in as much knowledge as you can. Make as many modeling assumptions, so that you can let the data speak about your particular question.
Here, your particular question is, are they running faster? So you're only really asking a question about the expectation. You really want to know if the expectation has changed. So as far as you're concerned, you're happy to make the assumption that the rest has been unchanged. OK. And so, this is the question we're asking. Is the expectation now less than 103.5? Because you specifically asked whether runners were going faster this year, right? They tend to go faster rather than slower, all right? OK. So this is the question we're asking in mathematical terms.
So first, when I did that, I need to basically fix the rest. And fixing the rest is actually part of the modeling assumptions. So I fixed my variance to be 373. OK? I assume that the variance has not changed between 2009 and 2012. Now, this is an assumption. It turns out it's wrong. So if you look at the data from 2012, this is not the correct assumption. But I'm just going to make it right now for the sake of argument, OK? And also the fact that it's Gaussian.
Now, this is going to be hard to violate, right? I mean, where did this bell-shaped curve come from? Well, it's just natural when you just measure a bunch of things. The central limit theorem appears in the small things of nature. I mean, that's the bedtime story you get about the central limit theorem. And that's why the bell-shaped curve is everywhere in nature. It's the sum of little independent things that are going on. And this Gaussian assumption, even if I wanted to relax it, there's not much else I can do. It is pretty robust across the years.
All right. So the only thing that we did not fix is the expectation of x1, which now I want to know what it is. And since I don't know what it is, I'm going to call it mu. And it's going to be a variable of interest, all right? So it's just a number mu. Whatever this is I can try to estimate it, maybe using maximum likelihood estimation. Probably using the average, because this is Gaussian. And we know that the maximum likelihood estimator for a Gaussian is just the average. And now we only want to test if mu is equal to 103.5, like it was in 2009. Or on the contrary, if mu is not equal to 103.5. And more specifically, if mu is actually strictly less than 103.5. That's the question you ask.
Now, why am I in writing mu equal to 103.5 or is less than 103.5 and equal to 103.5 versus not equal to 103.5? It's because since you asked me a more precise question, I'm going to be able to give you a more precise answer. And so, if your question is very specific-- are they running faster? I'm going to factor that in what I write. If you just ask me, is it the same? I'm going to have to write, or is it different than 103.5? And that's less information about what you're looking for, OK?
So by making all these modeling assumptions-- the fact that the variance doesn't change, the fact that it's still Gaussian-- I've actually reduced the number of. And I put numbers in quotes, because this is still an infinite of them. But I'm limiting the number of ways the hypothesis can be violated. The number of possible alternative realities for this hypothesis, all right?
For example, I'm saying there's no way mu can be larger than 103.5. I've already factored that in, OK? It could be. But I'm actually just going to say that if it's larger, all I'm going to be able to tell you is that it's not smaller. I'm not going to be able to tell you that it's actually larger, OK?
And the only way it can be rejected now. The only way I can reject my hypothesis is if x belongs to very specific family of distributions. If it has a distribution which is Gaussian with mean mu and variance of 373 for mu, which is less 103.5. All right?
So we started with basically was x1-- so that's the reality. x1 follows n 103.5 373, OK? And this is everything else, right? So for example, here is x follows some exponential, 0.1, OK? This is just another distribution here. Those are all the possible distributions.
What we said is we said, OK, first of all, let's just keep only those Gaussian distributions, right? And second, we said, well, among those Gaussian distributions, let's only look at those that have-- well, maybe this one should be at the boundary-- let's only look at the Gaussians here. So this guy here are all the Gaussians with mean mu and variance 373 for mu less than 103.5, OK?
So when you're going to give me data, I'm going to be able to say, well, am I this guy? Or am I one of those guys? Rather than searching through everything. And the more you search the easier for you to find something that fits better the data, right? And so, if I allow everything possible, then there's going to be something that just by pure randomness is actually going to look better for the data, OK?
So for example, if I draw 10 random variables, right? If n is equal to 10. And let's say they take 10 different values. Then it's actually more likely that those guys come from a discrete distribution that takes each of these values with probability 1 over 10, than actually some Gaussian random variable, right? That would be perfect. I can actually explain it.
If the 10 numbers I got were say-- let's say I collect 3, 90, 95, and 102. Then the most likely distribution for those guys is the discrete distribution that takes three values, 91 with probability 1/3, 95 with probability 1/3, and 102 with probably 1/3, right? That's definitely the most likely distribution for this. So if I allowed this, I would say, oh no. This is not distributed according to that. It's distributed according to this very specific distribution, which is somewhere in the realm of all possible distributions, OK?
So now we're just going to try to carve out all this stuff by making our assumptions. OK. So here in this particular example, just make a mental note that what we're doing is that I actually-- a little birdie told me that the reference number is 103.5, OK? That was the thing I'm actually looking for. In practice, it's actually seldom the case that you have this reference for yourself to think of, right? Maybe here, I just happen to have a full data set of all the runners of 2009. But if I really just asked you, I said, were runners faster in 2012 than in 2009? Here's $10 to perform your statistical analysis. What you're probably going to do is called maybe 10 runners from 2012, maybe 15 runners from 2009, ask them and try to compare their mean. There's no standard reference. You would not be able to come up with this 103.5, because these data maybe is expensive to get or something.
OK. So this is really more of the standard case, all right? Where you really compare two things with each other, but there's no actual ground truth number that you're comparing it to. OK. So we'll come back to that in a second. I'll tell you what the other example looks like.
So let's just stick to this example. I tell you it's 103.5, OK? Let's try to have our intuition work the same way. We said, well, averages worked well. The average, tell me, of over these 10 guys should tell me what the mean should be. So I can just say, well x bar is going to be close to the true mean by the law of large number. So I'm going to decide whether x bar is less than 103.5. And conclude that in this case, indeed mu is less than 103.5, because those two quantities are close, right? I could do that.
The problem is that this could go pretty wrong. Because if n is small, then I know that xn bar is not equal to mu. I know that xn bar is close to mu. But I also know that there's pretty high chance that it's not equal to mu. In particular, I know it's going to be somewhere at 1 over root n away from mu, right? 1 over root n being the root coming from what? CLT, right? That's the root n that comes from CLT. In blunt words, CLT tells me the mean is at distance 1 over root n from the expectation, pretty much. That's what it's telling.
So 1 over root n. If I have 10 people in there, 1 over root 10 is not a huge number, right? It's like 1/3 pretty much. So 1/3 103.5. If the true mean was actually 103.4, but my average was telling me it's 103.4 plus 1/3, I would actually come to two different conclusions, right?
So let's say that mu is equal to 103.4, OK? So you're not supposed to know this, right? That's the hidden truth. OK.
Now I have n is equal to 10. So I know that x bar n minus 103.4 is something of the order of 1 over the square root of 10, which is of the order of, say, 0.3. OK. So here, this is all hand wavy, OK? But that's what the central limit theorem tells me.
What it means is that it is possible that x bar n is actually equal to is actually equal to 103.4 plus 0.3, which is equal to 103.7. Which means that while the truth is that mu is less than 103.5, then I would conclude that mu is larger than 103.5, OK? And that's because I have not been very cautious, OK?
So what we want to do is to have a little buffer to account for the fact that xn bar is not a precise value for the true mu. It's something that's 1 over root n away from you. And so, what we want is the better heuristic that says, well, if I want to conclude that I'm less than 103.5, maybe I need to be less than 103.5 minus a little buffer that goes to 0 as my sample size goes to infinity. And actually, that's what the law of large number tells me. The central limit theorem actually tells me that this should be true, something that goes to 0 as n goes to infinity and the rate 1 over root n, right? That's basically what the central limit theorem tells me.
So to make this intuition more precise, we need to understand those fluctuations. We need to actually put in something that's more precise than these little wiggles here, OK? We need to actually have the central limit theorem come in.
So here is the example of comparing two groups. So pharmaceutical companies use hypothesis testing to test if a drug is efficient, right? That's what they do. They want to know, does my new drug work? And that's what the Federal Drug Administration office is doing on a daily basis. They ask for extremely well regulated clinical trials on a thousand people, and check, does this drug make a difference? Did everybody die? Does it make no difference? Should people pay $200 for a pill of sugar, right? So that's what people are actually asking.
So to do so, of course, there is no ground truth about-- so there's actually a placebo effect. So it's not like actually giving a drug that does not work is going to have no effect on patients. It will have a small effect, but it's very hard to quantify. We know that it's there, but we don't know what it is. And so rather than saying, oh the ground truth is no improvement, the ground truth is the placebo effect. And we need to measure what the placebo effect is.
So what we're going to do is we're going to split our patients into two groups. And there's going to be what's called a test group and a control group. So the word test here is used in a different way than hypothesis testing. So we'll just call it typically the drug group. And so, I will refer to mu drug for this guy, OK?
Now, this let's say this is a cough syrup, OK? And when you have a cough syrup, the way you measure the efficacy of a cough syrup is to measure how many times you cough per minute, OK? And so, if I define mu control the number of expectoration per hour. So just the expected number, right? This is the number I don't know, because I don't have access to the entire population of people that will ever take this cough syrup.
And so, I will call it mu control for the control group. So those are the people who have been actually given just like sugar, like maple syrup. And mu drug are those people who are given the actual syrup, OK? And you can imagine that maybe maple syrup will have an effect on expectorations per hour just because, well, it's just sweet and it helps, OK? And so, we don't know what this effect is going to be. We just want to measure if the drug is actually having just a better impact on expectorations per hour than the just pure maple syrup, OK?
So what we want to know is if mu drug is less than mu control. That would be enough. If we had access to all the populations that will ever take the syrup for all ages, then we would just measure, did this have an impact? And even if it's a slightly ever so small impact, then it's good to release this cough syrup, assuming that it has no side effects or anything like this, because it's just better than maple syrup, OK? The problem is that we don't have access to this. And we're going to have to make this decision based on samples that give me imprecise knowledge about mu drug and mu control.
So in this case, unlike the first case where we compared an unknown expected value to have a fixed number, which was one of the 103.5, here, we're just comparing two unknown numbers with each other, OK? So there's two sources of randomness. Trying to estimate the first one. And trying to estimate the second one.
Before I move on, I just wanted to tell you I apologize. One of the graders was not able to finish grading his problem sets for today. So for those of you who are here just to pick up their homework, feel free to leave now. Even if you have a name tag, I will pretend I did not read it. OK. So I'm sorry. You'll get it on Tuesday. And this will not happen again. OK.
So for the clinical trials, now I'm going to collect information. I'm going to collect the data from the control group. And I'm going to collect data from the test group, all right?
So my control group here. I don't have to collect the same number of people in the control group than in the drug group. Actually, for cough syrup, maybe it's not that important. But you can imagine that if you think you have the cure to a really annoying disease, it's actually hard to tell half of the people you will get a pill of nothing, OK? People tend to want to try the drug. They're desperate. And so, you have to have this sort of imbalance between who is getting the drug and who's not getting the drug.
And people have to qualify for the clinical trials. There's lots of fluctuations that affect what the final numbers of people who are actually going to get the drug and are going to get the control is going to be. And so, it's not easy for you to make those two numbers equal. You'd like to have those numbers equal if you can, but not necessarily. And by the way, this is all part of some mystical science called "design of experiments." And in particular, you can imagine that if one of the series had higher variants, you would want to like more people in this group than the other group. Yeah?
STUDENT: So when we're subtracting [INAUDIBLE] something that [INAUDIBLE] 0 [INAUDIBLE] to be satisfied. So that's on purpose [INAUDIBLE].
PROFESSOR: Yeah, that's on purpose. And I'll come to that in a second, all right? So basically, we're going to make it if your answer is, is this true? We're going to make it as hard as possible, but no harder for you to say yes to this answer. Because, well, we'll see why.
OK, so now we have two set of data, the x's and the y's. The x's are the ones for the drug. And the y's are the data that I collected from the people, who were just given a placebo, OK? And so, they're all IID random variables. And here, since it's the number of expectorations, I'm making a blunt modeling assumption. I'm just going to say it's Poisson. And it's characterized only by the mean mu drug or the mean mu control, OK? I've just made an assumption here. It could be something different. But let's say it's a Poisson distribution.
So now what I want to know is to test whether mu drug is less than mu control. We said that already. But the way we said it before was not as mathematical as it is now. Now we're actually making a test on the parameters of Poisson distribution. Whereas before, we were just making test on expected numbers, OK?
So the heuristic-- again, if we try to apply the heuristic now. Rather than comparing mu x bar drug to some fixed number, I'm actually comparing x bar drug to some control. But now here, I need to have something that accounts for, not only the fluctuations of x bar drug, but also for the fluctuations of x bar control, OK? And so, now I need something that goes to 0 when all those two things go to infinity. And typically, it should go to zero with 1 over root of n drug and 1 over square root of n control, OK? That's what the central limit theorem for both x bar drug and x bar control. Two central limit theorems are actually telling. OK. And then we can conclude that this happens.
And as you said, we're trying to make it a bit harder to conclude this. Because let's face it. If we were actually using two simple heuristic, right? For simplicity, right? So I can rewrite x bar drug less than x bar control minus this something that goes to 0. I can write it as x bar drug minus x bar control less than something negative, OK? This little something, OK?
So now let's look at those guys. This is the difference of two random variables. From the central limit theorem, they should be approximately Gaussian each. And actually, we're going to think of them as being independent. There's no reason why the people in the control group should have any effect on what's happening to the people in the test group. Those people probably don't even know each other. And so, when I look at this, this should look like n 0 with some mean and some variants, let's say I don't know what it is, OK?
The mean I actually know. It's mu drug minus mu control, OK? So if they were to plot the PDF of this guy, it would look like this. I would have something which is centered at mu drug minus mu control. And it would look like this, OK?
Now let's say that mu drug is actually equal to mu control. That this pharmaceutical company is a huge scam. And they really are trying to sell bottled corn syrup for $200 a pop, OK? So this is a huge scam. And the true things are actually equal to 0. So this thing is really centered about 0, OK?
Now, if were not to do this, then basically, half of the time I would actually come up with a distribution that's above this value. And half of the time I would have something that's below this value, which would mean that half of the scams would actually go through FDA if I did not do this. So what I'm trying to do is to say, well, OK. You have to be here, so that there is actually a very low probability that just by chance you end up being here. And we'll make all the statements extremely precise later on.
But I think the drug thing makes it interesting to see why you're making it hard, because You don't want to allow people to sell a thing like that. Before we go more into the statistical thinking associated to tests, let's just see how we would do this quantification, right? I mean after all, this is what we probably are the most comfortable with at this point. So let's just try to understand this.
And I'm going to make the statisticians favorite test, which is the thing that obviously you do at home all the time every time you get a new quarter, is testing whether it's a fair coin or not. All right? So this test, of course, exists only in textbooks. And I actually did not write this slide. I was lazy to just replace all this stuff by the Cherry Blossom Run.
So you have a coin. Now you have 80 observations, x1 to x80. So n is equal to 80. I have x1, xn, IID, Bernoulli p. And I want to know if I have a fair coin. So in mathematical language, I want to know if p is equal to 1/2.
Let's say this is just the heads, OK? And a biased coin? Well, maybe you would potentially be interested whether it's biased one direction or the other. But not being a fair coin is already somewhat of a discovery, OK? And so, you just want to know whether p is equal to 1/2 or p is not equal to 1/2, OK?
Now, if I were to apply the very naive first example to not reject this hypothesis. If I run this thing 80 times, I need to see exactly 40 heads and 40 tales. Now this is very unlikely to happen exactly. You're going to have close to 40 heads and close to 40 tails, but how close should those things be? OK? And so, the little something is going to be quantified by exactly this, OK?
So now here, let's say that my experiment gave me 54 heads. That's 54? Yeah. Which means that my xn bar is 54 over 80, which is 0.68. All right? So I have this estimator. Looks pretty large, right? It's much larger than 0.5, so it does look like, and my mom would certainly conclude, that this is a biased coin for sure, because she thinks I'm tricky. All right.
So the question is, can this be due to chance? Can this be due to chance alone? Like what is the likelihood that a fair coin would actually end up being 54 times on heads rather than 40? OK? And so, what we do is we say, OK, I need to understand, what is the distribution of the number of times it comes on heads? And this is going to be a binomial, but it's a little annoying to play with. So we're going to use the central limit theorem that tells me that xn bar minus p divided by square root of p1 minus p is approximately distributed as an n01. And here, since n is equal to 80, I'm pretty safe that this is actually going to work.
And I can actually use [INAUDIBLE], and put xn bar here. [INAUDIBLE] tells me that this is OK to do. All right.
So now I'm actually going to compute this. So here, I know this. This is square root of 80. This is a 0.68. What is this value here? We'll talk about it. Well, we're trying to understand what happens if it is a fair coin, right? So if fair, then p is equal to 0.5, right? So what I want to know is, what is the likelihood that a fair coin would give me 0.68? Let me finish.
All right. What is the likelihood that a fair coin will allow me to do this, so I'm actually allowed to plug-in p to be 0.5 here? Now, your question is, why do I not plug-in p to be 0.5? But you can. All right. I just want to make you plug-in p at one specific point, but you're absolutely right.
OK. Let's forget about your question for one second. So now I'm going to have to look at xn bar minus 0.5 divided by xn bar 1 minus xn bar. Then this thing is approximately Gaussian and 0,1 if the coin is fair. Otherwise, I'm going to have a mean which is not zero here. If the coin is something else, whatever I get here, right? Let's just write it for one second.
Let's do it. So what is the distribution of this if p-- so that's p is equal to 0.5. OK? Now if p is equal to 0.6, then this thing is just, well, I know that this is equal to square root of n xn bar minus 0.6, divided by xn bar 1 minus xn in the bar squared root, plus-- well, now the difference. Is So square root of n, 0.6 minus 0.5, divided by square root of xn bar 1 minus xn bar, right? Now if p is equal to 0.6, then this guy is n 0,1, but this guy is something different. This is just a number that depends on square root of n. It's actually pretty large.
So if I want to use the fact that this guy has a normal distribution, I need to plug-in the true value here. Now, the implicit question that I got was the following. It says, well, if you know what p is, then what's actually true is also this. If p is equal to 0.5, then since I know that root n xn bar minus p divided by square root of p 1 minus p is some n 0, 1, it's also true that square root of n xn bar minus 0.5 divided by square root of 0.5 1 minus 0.5 is n 0,1, right? I know what p is. I'm just going to make it appear.
OK. And so, what's actually nice about this particular [INAUDIBLE] experiment is that I can check if my assumption is valid by checking whether I'm actually-- so what I'm going to do right now is check whether this is likely to be a Gaussian or not, right? And there's two ways I can violate it. By violating mean, but also by violating the variance. And here, what I did in the first case, I said, well I'm not allowing you to check whether you've violated the variance. I'm just plugging whatever variance you're getting. Whereas here, I'm saying, well, there's two ways you can violate it. And I'm just going to factor everything in.
So now I can plug-in this number. So this is 80. This is 0.68. So I can compute all this stuff. I can compute all this stuff here as well. And what I get in this case, if I put the xn bar 1, I get 3.45, OK?
And now I claim that this makes it reasonable to reject the hypothesis that p is equal to 0.5. Can somebody tell me why?
STUDENT: It's pretty big.
PROFESSOR: Yeah, 3 is pretty big. So it's very unlikely. So this number that I should see should look like the number I would get if I asked a computer to draw one random Gaussian for me. This number, when I draw one random Gaussian, is actually a number with 99.9% this number will be between negative 3 and 3. With 78% it's going to be between negative 2 and 2. 68% is between minus 1 and 1. And with like 90% it's between minus 2 and 2.
So getting a 3.45 when you do this is extremely unlikely to happen, which means that you would have to be extremely unlucky for this to ever happen. Now, it can happen, right? It could be the case that you flip 80 coins and 80 of them are heads. With what probability does this happen? 1 over 2 to the 80, right? Which is probably better off playing the lottery with this kind of odds, right? I mean, this is just not going to happen, but it might happen.
So we cannot remove completely the uncertainty, right? It's still possible that this is due to noise. But we're just trying to make all the cases that are very unlikely go away, OK? And so, now I claim that 3.45 is very unlikely for a Gaussian. So if I were to draw the PDF of a standard Gaussian, right? So n 0, 1, right? So that's PDF of n 0, 1. 3.73 is basically here, OK? So it's just too far in the tails. Understood?
Now I cannot say that the probability that the Gaussian is equal to 373 is small, right? I just cannot say that, because it's 0. And it's also 0 for the probability that it's 0, even though the most likely values are around 0. It's the continuous random variable. Any value you give me, it's going to happen with probability zero.
So what we're going to say is, well, the fluctuations are larger than this number. The probability that I get anything worse than this is actually extremely small, right? Anything worse than this is just like farther than 3.73. And this is going to be what we control. All right? So in this case, I claim that it's quite reasonable to reject the hypothesis.
Is everybody OK with this? Everybody find this shocking? Or everybody has no idea what's going on? Do you have any questions? Yeah?
STUDENT: Regarding the case of p, where minus p isn't close to xn. If you use 1 minus p as 0.5, then you're dividing by a larger number than you would if you used xn. So it feels like our true number is not 3.45. It's something a little bit smaller than 3.45 for the distribution to actually be like 1/2. Because it seems like we're adding an unnecessary extra error by using xn bar. And we're adding an error that makes it seem that our result was less likely than it actually was.
PROFESSOR: That's correct. And you're right. I didn't want to plug-in the p everywhere, but you should plug it in everywhere you can. That's for sure, OK? So let's agree on that. And that's true that it makes the number a little bigger. You compute how much you would get, we would get if we 0.5 there. Well, I don't know what the square root of 80 is. Can somebody compute quickly? I'm not asking you to do it. But what I want is two times square root of 80 times 0.18. 3.22
OK. I can make the same cartoon picture with 3.22. But you're right. This is definitely more accurate. And I should have done this. I didn't want to get the confused message, OK?
All right. So now here's a second example that you can think of. So now I toss it 30 times. Still in the realm of the central limit theorem. I get 13 heads rather than 15. So I'm actually much closer to being exactly at half. So let's see if this is actually going to give me a plausible value.
So I get 0.33 in average. If the truth was 0.5, I would get something like 0.77. And now I claim that 0.77 is a plausible realization for some standard Gaussian, OK? Now, 0.77 is going to look like it's here. So that could very well be something that just comes because of randomness.
And again, if you think about it. If I told you, you were expecting 15, you saw 13, you're happy to put that on the account of randomness. Now of course, the question is going to be, where do I draw the line? Right? Is 12 the right number? Is 11? Is 10? What is it?
So basically, the answer is it's whatever you want to be. The problem it's hard to think on the scale, right? What does it mean to think on the scale? If I can't think in this scale, I'm going to have to think on the scale of 80 of them. I'm going to have to think on the scale of running 100 coin flips. And so, this scale is a moving target all the time. Every time you have a new problem, you have to have a new skill in mind. And it's very difficult.
The purpose of statistical analysis, and in particular this process that content that takes your x bar and turns it into something that should be standard Gaussian, allows you to map the value of x bar into a scale that is the standard scale of the Gaussian. All right? Now, all you need to have in mind is, what is a large number or an unusually large number for a Gaussian? That's all you need to know.
So here, by the way, 0.77 is not this one, because it was actually negative 0.77. So this one. OK. So I can be on the right or I can be on the left of zero. But they are still plausible. So understand you could actually have in mind all the values that are plausible for a Gaussian and those that are not plausible, and draw the line based on what you think is the right number. So how large should a positive value of a Gaussian to become unreasonable for you? Is it 1? Is it 1.5? Is it 2? Stop me when I get there. Is it 2.5? Is it 3?
STUDENT: I think 2.5 is definitely too big.
STUDENT: Doesn't it depend on our prior? Let's say we already have really good evidence at this point [INAUDIBLE]
PROFESSOR: Yeah, so this is not Bayesian statistics. So there's no such thing as a prior right now. We'll get there. You'll have your moment during one short chapter. So there's no prior here, right? It's really a matter of whether you think is a Gaussian large or not. It's not a matter of coins. It's not a matter of anything.
Now I've just reduced it to just one question. So forget about everything we just said. And I'm asking you, when do you decide that a number is too large to be reasonably drawn from a Gaussian? And this number is 2 or 1.96. And that's basically the number that you get from this quintel. We've seen the 1.96 before, right? It's actually q alpha over 2, where alpha is equal to 5%. That's a quintel of a Gaussian.
So actually, what we do is we map it again. So are now at the Gaussians. And then we map it again into some probabilities, which is the probability of being farther than this thing. And now probabilities, we can think. Probability is something that quantifies my error. And the question is what percentage of error am I willing to tolerate.
And if I tell you 5%, that's something you can really envision. What it means is that if I were to do this test a million times, 5% of the time I would expose myself to making a mistake. All right. That's all it would say. If you said, well, I don't want to account for 5%, maybe I want 1%, then you have to move from 1.94 to 2.5. And then if you say at I want 0.01%, then you have to move to an even larger number. So it depends.
But stating this number 1%, 5%, 10% is much easier than seeing those numbers 1.96, 2.5, et cetera. So we're just putting everything back on the scale. All right.
To conclude, this, again, as we said, does not suggest that the coin is unfair. Now, it might be that the coin is unfair. We just don't have enough evidence to say that. And that goes back to your question about, why are we siding with the fact that we're making it harder to conclude that the runners were faster? And this is the same thing. We're making it harder to conclude that the coin is biased. Because there is a status quo. And we're trying to see if we have evidence against the status quo. The status quo for the runners is they ran the same speed. The status quo for the coin, we can probably all agree is that the coin is fair.
The status quo for a drug? I mean, again, unless you prove me that you're actually not a scammer is that the status quo is that this is maple syrup. There's nothing in there. Why would you? I mean, if I let you get away with it, you would put corn syrup. It's cheaper. OK.
So now let's move on to math. All right. So when I started doing mathematics, I'm going to have to talk about random variables and statistical models. And here, there is actually a very simple thing, which actually goes back to this picture. A test is really asking me if my parameter is in some region of the parameter set or another region of the parameter set, right? Yes/no.
And so, what I'm going to be given is a sample, x1, xn. I have a model. And again, those can be braces depending on the day. And so, now I'm going to give myself theta 0 and theta 1 to this joint subset. OK. So capital theta here is the space in which my parameter can live.
To make two disjoint subsets, I could just split this guy in half, right? I'm going to say, well, maybe it's this guy and this guy. OK. So this is theta 0. And this is theta 1.
What it means when I split those two guys, in test, I'm actually going to focus only on theta 0 or theta 1. And so, it means that a priori I've already removed all the possibilities of theta being in this region. What does it mean? Go back to the example of runners.
This region here for the Cherry Blossom Run is the set of parameters, where mu was larger than 103.5, right? We removed that. We didn't even consider this possibility. We said either it's less-- sorry. That's mu equal to 103.5. And this was mu less than 103.5, OK?
But these guys were like if it happens, it happens. I'm not making any statement about that case. All right? So now I take those two subsets. And now I'm going to give them two different names, because they're going to have an asymmetric role.
h0 is the null hypothesis. And h1 is the alternative hypothesis. h0 is the status quo. h1 is what is considered typically as scientific discovery.
So if you're a regulator, you're going to push towards h0. If you're a scientist, you're going to push towards h1. If you're a pharmaceutical company, you're going to push towards h1. OK?
And so, depending on whether you want to be conservative-- oh, I can find evidence in a lot of data. As soon as you give me three data points, I'm going to be able to find evidence. That means I'm going to tend to say, oh, it's h1. But if you say you need a lot of data before you can actually move away from the status quo, that's age h0, OK? So think of h0 as being status quo, h1 being some discovery that goes against the status quo. All right?
So if we believe that the truth theta is either in one of those, what we say is we want to test h0 against h1. OK. This is actually wording. So remember, because this is how your questions are going to be formulated. And this is how you want to probably communicate as a statistician. So you're going to say I have the null and I have an alternative. I want to test h0 against h1. I want to test the null hypothesis against the alternative hypothesis, OK?
Now, the two hypotheses I forgot to say are actually this. h0 is that the theta belongs to theta 0. And h1 is that it theta belongs to theta 1. OK. So here, for example, theta was mu. And that was mu equal to 103.5. And this was mu less than 103.5. OK? So typically, they're not going to look like thetas and things like that. They're going to look like very simple things, where you take your usual notation for your usual parameter and you just say in mathematical terms what relationship this should be satisfying, right?
For example, in the drug example, that would be mu drug is equal to mu control. And here, that would be mu drug less than mu control. The number of expectorations for people who take the drug for the cough syrup is less than the number of expectoration of people who take the corn syrup, OK?
So now what we want to do. We've set up our hypothesis testing problem. You're a scientist. You've set up your problem. Now what you're going to do is collect data. And what you're going to try to find on this data is evidence against h0. And the alternative is going to guide you into which direction you should be looking for evidence against this guy. All right?
And so, of course, the narrower the alternative, the easier it is for you, because you just have to look at the one possible candidate, right? But typically, h1 is a big group, like less than. Nobody tells you it's either it's 103.5 and 103. People tell you it's either 103.5 or less than 103.5. OK. And so, what we want to do is to decide whether we reject h0. So we look for evidence against h0 in the data, OK?
So as I said, h0 and h1 do not play a symmetric role. It's very important to know which one you're going to place as h0 and which one you're going to place at h1. If it's a close call, you're always going to side with h0, OK? So you have to be careful about those. You have to keep that in mind that if it's a close call, if data does not carry a lot of evidence, you're going to side with h0. And so, you're actually never saying that h0 is true. You're just saying I did not find evidence against h0. You don't say I accept that h0. You say I failed to reject h0. OK.
And so one of the things that you want to keep in mind when you're doing this is this innocent until proven guilty. So if you come from a country, like America, there's such a thing. And in particular, lack of evidence does not mean that you are not guilty, all right? OJ Simpson was found not guilty. It was not found innocent, OK?
And so, this is basically what happens is like the prosecutor brings their evidence. And then the jury has to decide whether they were convinced that this person was guilty of anything. And the question is, do you have enough evidence? But if you don't have evidence, it's not the burden of the defender to prove that they're innocent. Nobody's proving their innocent. I mean, sometimes it helps. But you just have to make sure that there's not enough evidence against you, OK? And that's basically what it's doing. You're h0 until proven h1.
So how are we going to do this? Well, as I said, the role of estimators in hypothesis testing is played by something called tests. And a test is a statistic. Can somebody remind me what a statistic is? Yep?
STUDENT: The measure [INAUDIBLE]
PROFESSOR: Yeah, that's actually just one step more. So it's a function of the observations. And we require it to be measurable. And as a rule of thumb, measurable means if I give you data, you can actually compute it, OK? If you don't see a [INAUDIBLE] or an [INAUDIBLE], you don't have to think about it. All right.
And so, what we do is we just have this test. But now I'm actually asking only from this test a yes/no answer, which I can code as 0, 1, right? So as a rule of thumb, you say that, well, the test is equal to 0 then h0. The test is equal to 1 at h1. And as we said, is that if the test is equal to 0, it doesn't mean that a 0 is truth. It means that I feel to rejected h0. And if the test is equal to 1, I reject h0.
So I have two possibilities. I look at my data. I turn it into a yes/no answer. And yes/no answer is really h0 or h1, OK? Which one is the most likely basically. All right.
So in the coin flip example, our test statistic is actually something that takes value 0, 1. And anything, any function that takes value at 0, 1 is an indicator function, OK? So an indicator function is just a function. So there's many ways you can write it. So it's a 1 with a double bar. If you aren't comfortable with this, it's totally OK to write i of something, like i of a. OK. And that's what? So a, here, is a statement, like an inequality, an equality, some mathematical statement, OK? Or not mathematical. I mean, "a" can be, you know, my grandma is 20 years old, OK? And so, this is basically 1 if a is true, and 0 if a is false. That's the way you want to think about it.
This function takes only two values, and that's it. So here's the example that we had. We looked at whether the standardized xn bar, the one that actually is approximately n 0,1 was larger than something in absolute value, either very large or very small, but negative. I'm going back to this picture. We wanted to know if this guy was either to the left of something or to the right of something, right? Was it in these regions?
Now this indicator, I can view this as a function of x bar. What it does, it really splits the possible values of x bar, which is just a real number, right? In two groups. The groups on which they lead to a value, which is 1. And the groups on which they lead to value, which is 0, right?
So what it does is that I can actually think of it as the real line, x bar. And there's basically some values here, where I'm going to get a 1. Maybe I'm going to get a 0 here. Maybe I'm going to get a 0. Maybe I'm going to get a 1. I'm just splitting all possible values of x bar. And I see whether to spit out the side which is 0 or which is 1.
In this case, it's not clear, right? I mean, the function is very nonlinear. It's x bar minus 0.5 divided by the square root of x bar 1 minus x bar. If we put the p in the denominator, that would be clear. That would just be exactly something that looks like this. The function would be like this. It would be 1 if it's smaller than some value. Less than 0 if it's in between two values. And then 1 again. So that's psi, OK?
So this is 1, right? This is 1. And this is 0. So if x bar is too small or if x bar is too large, then I'm getting a value 1. But if it's somewhere in between, I'm getting a value 0. Now, if I have this weird function, it's not clear how this happened.
So the picture here that I get is that I have a weird non-linear function, right? So that's x bar. That's square root of n x bar n 0.5 divided by the square root of x bar n 1 minus x bar n, right? That's this function. A priori, I have no idea what this function looks like. We can probably analyze this function, but let's pretend we don't know. So it's like some crazy stuff like this.
And all I'm asking is whether in absolute value it's larger than c, which means that is this function larger than c or less than minus c? The intervals on which I'm going to say 1 are this guy, this guy, this guy, and this guy. OK. And everywhere else, I'm seeing 0. Everybody agree with this? This is what I'm doing.
Now of course, it's probably easier for you to just package it into this nice thing that's just either larger than c, an absolute value, or less Than C. I want to have to plot this function. In practice, you don't have to.
Now, this is where I am actually claiming. So here, I actually defined to you a test. And I promised, starting this lecture, by saying, oh, now we're going to do something better than computing the averages. Now I'm telling you it's just computing an average. And the thing is the test is not just the specification of this x bar. It's also the specification of this constant c. All right? And the constant c was exactly where our belief about what a large value for a Gaussian is. That's exactly where it came in. So this choice of c is basically a threshold at which we decide above this threshold this isn't likely to come from a Gaussian. Below this threshold we decide that it's likely to come from a Gaussian. So we have to choose what this threshold is based on what we think likely means.
Just a little bit more of those things. So now we're going to have to characterize what makes a good test, right? Well, I'll come back to it in a second. But you could have a test that says reject all the time. And that's going to be bad test, right? The FDA is not implementing a test that says, yes all drugs work, now let's just go to Aruba, OK?
So people are trying to have something that tries to work all the time. Now FDA's not either saying, let's just say that no drugs work, and let's go to Aruba, all right? They're just trying to say the right thing as often as possible. And so, we're going to have to measure this.
So the things that are associated to a test are the rejection region. And if you look at this x in en, such that psi of x is equal to 1, this is exactly this guy that I drew. So here, I summarized the values of the sample into their average. But the values of the sample that I collect will lead to a test that says 1. All right? So this is the rejection region.
If I collect a data point, technically I have-- so I have e to the n, which is a big space like this. So that's e to the n. Think of it as being the space of xn bars. And I have a function that takes only value 0, 1. So I can decompose it into this part where it takes value 0 and the part where it takes value 1. And those can be super complicated, right?
Can have a thing like this. Can have some weird little islands where it takes value 1. I can have some islands where it's takes value 0. I can have some weird stuff going on. But I can always partition it into the value where it takes value 0 and the value where it takes value 1. And the value where it takes 1, if psi is equal to 1, this is called the rejection region of the plot, OK? So just the samples that would lead me to rejecting.
And notice that this is the indicator of the rejection region. The test is the indicator of the rejection region.
So there's two ways you can make an error when there's a test. Either the truth is in h1, and you're saying actually it's h1. Or the truth is in h1, and you say it's h0. And that's how we build-in the asymmetry between h0 and h1. We control only one of the two errors. And we hope for the best for the second one.
So the type 1 error is the one that says, well, if it is actually the status quo, but a claim that there is a discovery-- if it's actually h0, but I claim that I'm in h1, then I admit I commit a type I error. And so the probability of type I error is this function alpha of psi, which is the probability of saying that psi is equal to 1 when theta is in h0.
Now, the problem is that this is not just number, because theta is just like moving all over h0, right? There's many values that theta can be, right? So theta is somewhere here. I erased it, OK.
All right. For simplicity, we're going to think of theta as being mu and 103.5, OK? And so, I know that this is theta 1. And just this point here was theta 0, OK? Agreed? This is with the Cherry Blossom Run.
Now, here in this case, it's actually easy. I need to compute this function alpha of psi, which maps theta in theta 0 to p theta of psi equals 1. So that's the probability that I reject when theta is in h0. Then there's only one of them to compute, because theta can only take this one value. So this is really 103.5. OK. So that's the probability that I reject when the true mean was 103.5.
Now, if I was testing whether-- if h0 was this entire guy here, all the values larger than 103.5, then I would have to compute this function for all possible values of the theta in there. And guess what? The worst case is when it's going to be here. Because it's so close to the alternative that that's where I'm making the most error possible.
And then there's the type 2 error, which is defined basically in the symmetric ways. The function that maps theta to the probability. So that's the probability of type 2 errors. The probability that I fail to reject h0, right? If psi is equal to 0, I fail to reject h0. But that actually came from h1, OK?
So in this example, let's clear. If I'm here, like if the true mean was 100, I'm looking at the probability that the true mean is actually 100, and I'm actually saying it was 103.5. Or it's not less than 103.5. Yeah?
STUDENT: I'm just still confused by the notation. When you say that [INAUDIBLE] theta sub 1 arrow r, I'm not sure what that notation means.
PROFESSOR: Well, this just means it's a function that maps theta 0 to r. You've seen functions, right? OK. So that's just the way you write. So that means that's a function f that goes from, say, r r, and that maps x to x squared. OK. So here, I'm just saying I don't have to consider all possible values. I'm only considering the values on theta 0. I put r actually. I could restrict myself to the interval 0, 1, because those are probabilities.
So it's just telling me where my function comes from and where my function goes to. And beta is a function, right? So beta psi of theta is just the probability that theta is equal to 1. And I could define that for all thetas-- sorry. If psi is equal to 0 in this case. And that could define that for all thetas. But the only ones that lead to an error are the thetas that are in h1. I mean, I can define this function. It's just not going to correspond to an error, OK?
And the power of a test is the smallest-- so the power is basically 1 minus an error. 1 minus the probability of an error. So it's the probability of making a correct decision, OK? So it's the probability of making a correct decision under h1, that's what the power is. But again, this could be a function. Because there's many ways that can be in h1 if h1 is an entire set of numbers. For example, all the numbers there are less than 103.5.
And so, what I'm doing here when I define the power of a test, I'm looking at the smallest possible of those values, OK? So I'm looking at this function. Maybe I should actually expand a little more on this. OK.
So beta psi of theta is the probability under theta that psi is equal to 0, right? That's the probability in theta 1, which means then the alternative, that they feel to reject. And I really should, because theta was actually in theta 1, OK?
So this thing here is the probability of type 2 error. Now, this is 1 minus the probability that I did reject and I should have rejected. That's just a little off the complement. Because if psi is not equal to 0, then it's equal to 1. So now if I rearrange this, it tells me that the probability that psi is equal to 1-- this is actually 1 minus beta psi of theta. So that's true for all thetas in theta 1.
And what I'm saying is, well, this is now a good thing, right? This number being large is a good thing. It means I should have rejected, and I rejected. I want this to happen with large probability. And so, what I'm going to look at is the most conservative choice of this number, right? Rather than being super optimistic and say, oh, but indeed if theta was actually equal to zero, then I'm always going to conclude that-- I mean, if mu is equal to 0, everybody runs in 0 seconds, then I with high probability I'm actually going to make no mistake. But really, I should look at the worst possible case, OK? So what I'm looking at is basically the smallest value it can take on theta one is called power of psi. Power of the test psi, OK? So that's the smallest possible value it can take.
All right. So I'm sorry. This is a lot of definitions that you have to sink in. And it's not super pleasant. But that's what testing is. There's a lot of jargon. Those are actually fairly simple things. Just maybe you should get a sheet for yourself. And say, these are the new terms that I learned. What is their test, rejection region? Probably of type I error, probably of type 2 error, and power. Just make sure you know what those guys are. Oh. And null and alternative hypothesis, OK?
And once you know all these things, you know what I'm talking about. You know what I'm referring to. And this is just jargon. But in the end, those are just probabilities. I mean, these a natural quantities. Just for some reason, people have been used to using different terminology.
So just to illustrate. When do I make a typo 1 error? And when do I not make a type 1 error? So I make a type 1 error if h0 is true and I reject h0, right? So the off diagonal blocks are when I make an error. When I'm on the diagonal terms, h1 is true and I reject h0, that's a correct decision. When h0 is true and I fail to reject h0, that's also the correct decision to make.
So I only make errors when I'm in one of the red blocks. And one block is the type 1 error and the other block is the type 2 error. That's all it means, OK? So you just have to know which one we called one. I mean, this was chosen in a pretty ad hoc way.
So to conclude this lecture, let me ask you a few questions. If in a US court, the defendant is found either say, let's just say for the sake of discussion, innocent or guilty. All right? It's really guilty for not guilty, but let's say innocent or guilty. When does the jury make a type 1 error? Yep? And he's guilty? And he's innocent, right?
The status quo, everybody is innocent until proven guilty. So that's our h0 is that the person is innocent. And so, that means that h0 is innocent. And so, we're looking at the probably of type 1 error, so that's when we reject the fact that it's innocent. So conclude that this person is guilty, OK? So type 1 error is when this person is innocent and we conclude it's guilty.
What is the type 2 error? Letting a guilty person go free, which actually according to the constitution, is the better of the two. All right? So what we're going to try to do is to control the first one, and hope for the best for the second one.
How could the jury make sure that they make no type 1 error ever? Always let the guy go free, right? What is the effect on the type 2 error? Yeah, it's the worst possible, right? I mean, basically, for every guy that's guilty, you let them go. That's the worst you can do.
And same thing, right? How can the jury make sure that there's no type 2 error? Always convict. What is the effect on the American budget? What is the effect on the type 1 error? Right. So the effect is that basically the type 1 error is maximized. So there's this trade off between type 1 and type 2 error that's inherent. And that's why we have this sort of multi objective thing. We're trying to minimize two things at the same time.
And I can't find many ad hoc ways, right? So if you've taken any optimization, trying to optimize two things when one is going up while the other one is going down, the only thing you can do is make ad hoc heuristics. Maybe you try to minimize the sum of those two guys. Maybe you try to minimize 1/3 of the first guy plus 2/3 of the second guy. Maybe you try to minimize the first guy plus the square of the second guy. You can think of many ways, but none of them is more justified than the other.
However, for statistical hypothesis testing, there's one that's very well justified, which is just constrain your type 1 error to be the smallest, to be at a level that you deem acceptable. 5%. I want to convict at most 5% of innocent people. That's what I deem reasonable. And based on that, I'm going to try to convict as many people as they can, all right? So that's called the Nieman Pearson paradigm, and we'll talk about it next time. All right. Thank you.