# Lecture 21: Generalized Linear Models

Flash and JavaScript are required for this feature.

Download the video from iTunes U or the Internet Archive.

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare and ocw.mit.edu.

PHILIPPE RIGOLLET: The chapter is a natural capstone chapter for this entire course. We'll see some of the things we've seen during maximum likelihood and some of the things we've seen during linear regression, some of the things we've seen in terms of the basic modeling that we've had before. We're not going to go back to much inference questions.

It's really going to be about modeling. And in a way, generalized linear models, as the word says, are just a generalization of linear models. And they're actually extremely useful. They're often forgotten about and people just jump onto machine learning and sophisticated techniques. But those things do the job quite well.

So let's see in what sense they are a generalization of the linear models. So remember, the linear model looked like this. We said that y was equal to x transpose beta plus epsilon, right? That was our linear regression model.

And it's-- another way to say this is that if-- and let's assume that those were, say, Gaussian with mean 0 and identity covariance matrix. Then another way to say this is that the conditional distribution of y given x is equal to-- sorry, I a Gaussian with mean x transpose beta and variance-- well, we had a sigma squared, which I will forget as usual-- x transpose beta and then sigma squared. OK, so here, we just assumed that-- so what is regression is just saying I'm trying to explain why as a function of x.

Given x, I'm assuming a distribution for the y. And this x is just going to be here to help me model what the mean of this Gaussian is, right? I mean, I could have something crazy. I could have something that looks like y given x is n0 x transpose beta. And then this could be some other thing which looks like, I don't know, some x transpose gamma squared times, I don't know, x, x transpose plus identity-- some crazy thing that depends on x here, right?

And we deliberately assumed that all the thing that depends on x shows up in the mean, OK? And so what I have here is that y given x is a Gaussian with a mean that depends on x and covariance matrix sigma square identity. Now the linear model assumed a very specific form for the mean. It said I want the mean to be equal to x transpose beta which, remember, was the sum from, say, j equals 1 to p of beta j xj, right?

It's where the xj's are the coordinates of x. But I could do something also more complicated, right? I could have something that looks like instead , replace this by, I don't know, sum of beta j log of x to the j divided by x to the j squared or something like this, right?

I could do this as well. So there's two things that we have assumed. The first one is that when I look at the conditional distribution of y given x, x affects only the mean. I also assume that it was Gaussian and that it affects only the mean. And the mean is affected in a very specific way, which is linear in x, right?

So this is essentially the things we're going to try to relax. So the first thing that we assume, the fact that y was Gaussian and had only its mean [INAUDIBLE] dependant no x is what's called the random component. It just says that the response variables, you know, it sort of makes sense to assume that they're Gaussian.

And everything was essentially captured, right? So there's this property of Gaussians that if you tell me-- if the variance is known, all you need to tell me to understand exactly what the distribution of a Gaussian is, all you need to tell me is its expected value. All right, so that's this mu of x. And the second thing is that we have this link that says, well, I need to find a way to use my x's to explain this mu you and the link was exactly mu of x was equal to x transpose beta.

Now we are talking about generalized linear models. So this part here where mu of x is of the form-- the way I want my beta, my x, to show up is linear, this will never be a question. In principle, I could add a third point, which is just question this part, the fact that mu of x is x transpose beta.

I could have some more complicated, nonlinear function of x. And then we'll never do that because we're talking about generalized linear model. The only thing with generalize are the random component, the conditional distribution of y given x, and the link that just says, well, once you actually tell me that the only thing I need to figure out is the mean, I'm just going to slap it exactly these x transpose beta thing without any transformation of x transpose beta.

So those are the two things. It will become clear what I mean. This sounds like a tautology, but let's just see how we could extend that. So what we're going to do in generalized linear models-- right, so when I talk about GLNs, the first thing I'm going to do with my x is turn it into some x transpose beta. And that's just the l part, right?

I'm not going to be able to change. That's the way it works. I'm not going to do anything non-linear.

But the two things I'm going to change is this random component, which is that y, which used to be some Gaussian with mean mu of x here in sigma squared-- so y given x, sorry-- this is going to become y given x follows some distribution. And I'm not going to allow any distribution. I want something that comes from the exponential family.

Who knows what the exponential family of distribution is? This is not the same thing as the exponential distribution. It's a family of distributions.

All right, so we'll see that. It's-- wow. What can that be? Oh yeah, that's actually [INAUDIBLE]. So-- I'm sorry?

AUDIENCE: [INAUDIBLE]

PHILIPPE RIGOLLET: I'm in presentation mode. That should not happen. OK, so hopefully, this is muted.

So essentially, this is going to be a family of distributions. And what makes them exponential typically is that there's an exponential that shows up in the definition of the density, all right? We'll see that the Gaussian belongs to the exponential family. But they're slightly less expected ones because there's this crazy thing that a to the x is exponential x log a, which makes the potential show up without being there.

So if there's an exponential of some power, it's going to show up. But it's more than that. So we'll actually come to this particular family of distribution.

Why this particular family? Because in a way, everything we've done for the linear model with Gaussian is going to extend fairly naturally to this family. All right, and it actually also, because it encompasses pretty much everything, all the distributions we've discussed before.

All right, so the second thing that I want to question-- right, so before, we just said, well, mu of x was directly equal to this thing. Mu of x was directly x transpose beta. So I knew I was going to have an x transpose beta and I said, well, I could do something with this x transpose beta before I used it to explain the expected value.

But I'm actually taking it like that. Here, we're going to say, let's extend this to some function is equal to this thing. Now admittedly, this is not the most natural way to think about it. What you would probably feel more comfortable doing is write something like mu of x is a function. Let's call it f of x transpose beta.

But here, I decide to call f g inverse. OK, let's just my g inverse. Yes.

AUDIENCE: Is this different then just [INAUDIBLE]

PHILIPPE RIGOLLET: Yeah. I mean, what transformation you want to put on your x's?

AUDIENCE: [INAUDIBLE]

PHILIPPE RIGOLLET: Oh no, certainly not, right? I mean, if I give you-- if I force you to work with x1 plus x2, you cannot work with any function of x1 plus any function of x2, right? So this is different.

All right, so-- yeah. The transformation would be just the simple part of your linear regression problem where you would take your exes, transform them, and then just apply another linear regression. This is genuinely new.

Any other question? All right, so this function g and the reason why I sort of have to, like, stick to this slightly less natural way of defining it is because that's g that gets a name, not g inverse that gets a name. And the name of g is the link function.

So if I want to give you a generalized linear model, I need to give you two ingredients. The first one is the random component, which is the distribution of y given x. And it can be anything in what's called the exponential family of distributions.

So for example, I could say, y given x is Gaussian with mean mu x sigma identity. But I can also tell you y given x is gamma with shared parameter equal to alpha of x, OK? I could do some weird things like this.

And the second thing is I need to give you a link function. And the link function is going to become very clear how you pick a link function. And the only reason that you actually pick a link function is because of compatibility.

This mu of x, I call it mu because mu of x is always the conditional expectation of y given x, always, which means that let's think of y as being a Bernoulli random variable. Where does mu of x live?

AUDIENCE: [INAUDIBLE]

PHILIPPE RIGOLLET: 0, 1, right? That's the expectation of a Bernoulli. It's just the probability that my coin flip gives me 1.

So it's a number between 0 and 1. But this guy right here, if my x's are anything, right-- think of any body measurements plus [INAUDIBLE] linear combinations with arbitrarily large coefficients. This thing can be any real number.

So the link function, what it's effectively going to do is make those two things compatible. It's going to take my number which, for example, is constrained to be between 0 and 1 and map it into the entire real line. If I have mu which is forced to be positive, for example, in an exponential distribution, the mean is positive, right?

That's the, say, don't know, inter-arrival time for Poisson process. This thing is known to be positive for an exponential. I need to map something that's exponential to the entire real line. I need a function that takes something positive and [INAUDIBLE] everywhere.

So we'll see. By the end of this chapter, you will have 100 ways of doing this, but there are some more traditional ones [INAUDIBLE]. So before we go any further, I gave you the example of a Bernoulli random variable. Let's see a few examples that actually fit there. Yes.

AUDIENCE: Will it come up later [INAUDIBLE] already know why do we need the transformer [INAUDIBLE] why don't [INAUDIBLE]

PHILIPPE RIGOLLET: Well actually, this will not come up later. It should be very clear from here because if I actually have a model, I just want it to be plausible, right? I mean, what happens if I suddenly decide that my-- so this is what's going to happen.

You're going to have only data to fit this model. Let's say you actually forget about this thing here. You can always do this, right? You can always say I'm going to pretend my y's just happen to be the realizations of said Gaussians that happen to be 0 or 1 only. You can always, like, stuff that in some linear model, right?

You will have some least squares estimated for beta. And it's going to be fine. For all the points that you see, it will definitely put some number that's actually between 0 and 1.

So this is what your picture is going to look like. You're going to have a bunch of values for x. This is your y. And for different-- so these are the values of x that you will get.

And for a y, you will see either a 0 or a 1, right? Right, that's what your Bernoulli dataset would look like with a one dimensional x. Now if you do least squares on this, you will find this.

And for this guy, this line certainly takes values between 0 and 1. But let's say now you get an x here. You're going to actually start pretending that the probability it spits out one conditionally in x is like 1.2, and that's going to be weird.

Any other questions? All right, so let's start with some examples. Right, I mean, you get so used to them through this course.

So the first one is-- so all these things are taken. So there's a few books on generalizing, your models, generalize [INAUDIBLE] models. And there's tons of applications that you can see.

Those are extremely versatile, and as soon as you want to do modeling to explain some y given x, you sort of need to do that if you want to go beyond linear models. So this was in the disease occurring rate. So you have a disease epidemic and you want to basically model the expected number of new cases given-- at a certain time, OK?

So you have time that progresses for each of your reservation. Each of your reservation is a time stamp-- say, I don't know, 20th day. And your response is the number of new cases.

And you're going to actually put your model directly on mu, right? When I looked at this, everything here was on mu itself, on the expected, right? Mu of x is always the expected-- the conditional expectation of y given x. right?

So all I need to model is this expected value. So this mu I'm going to actually say-- so I look at some parameters, and it says, well, it increases exponentially. So I want to say I have some sort of exponential trend. I can parametrize that in several ways.

And the two parameters I want to slap in is, like, some sort of gamma, which is just the coefficient. And then there's some rate delta that's in the exponential. So if I tell you it's exponential, that's a nice family of functions you might want to think about, OK?

So here, mu of x, if I want to keep the notation, x is gamma exponential delta x, right? Except that here, my x are t1, t2, t3, et cetera. And I want to find what the parameters gamma and delta are because I want to be able to maybe compare different epidemics and see if they have the same parameter or maybe just do some prediction based on the data that I have without-- to extrapolate in the future.

So here, clearly mu of x is not of the form x transpose beta, right? That's not x transpose beta at all. And it's actually not even a function of x transpose data, right?

There's two parameters, gamma and delta, and it's not of the form. So here we have x, which is 1 and x, right? I have two parameters.

So what I do here is that I say, well, first, let me transform mu in such a way that I can hope to see something that's linear. So if I transform mu, I'm going to have log of mu, which is log of this thing, right? So log of mu of x is equal, well, to log of gamma plus log of exponential delta x, which is delta x.

And now this thing is actually linear in x. So I have that this guy is my first beta 1. And so that's beta 1 finds 1.

And this guy is beta 2-- times, sorry that said beta 0-- times 1, and this guy is beta 1 times x. OK, so that looks like a linear model. I just have to change my parameters-- my parameters beta 1 becomes the log of gamma and beta 2 becomes delta itself.

And the reason why we do this is because, well, the way we put those gamma and those delta was just so that we have some parametrization. It just so happens that if we want this to be linear, we need to just change the parametrization itself. This is going to have some effects.

We know that it's going to have some effect in the fissure information. It's going to have a bunch of effect to change those things. But that's what needs to be done to have a generalized linear model.

Now here, the function that I took to turn it into something that's linear is simple. It came directly from some natural thing I would do here, which is taking the log. And so the function g, the link that I take, is called the log link very creatively. And it's just the function that I apply to mu so that I see something that's linear and that looks like this. So now this only tells me how to deal with the link function. But I still have to deal with 0.1.

And this, again, is just some modeling. Given some data, some random data, what distribution do you choose to explain the randomness? And this-- I mean, unless there's no choice, you know, it's just a matter of practice, right? I mean, why would it be Gaussian and not, you know, doubly exponential?

This is-- there's matters of convenience that come into this, and there's just matter of experience that come into this. You know, I remember when you chat with engineers, they have a very good notion of what the distribution should be. They have y bold distributions. You know, they do optics and things like this.

So there's some distributions that just come up but sometimes just have to work. Now here what do we have? The thing we're trying to measure, y-- as we said, so mu is the expectation, the conditional expectation, of y given x.

But y is the number of new cases, right? Well it's a number of. And the first thing you should think of when you think about number of, if it were bounded above, you would think binomial, baby.

But here, it's just a number. So you think Poisson. That's how insurers think. I have a number of, you know, claims per year. This is a Poisson distribution.

And hopefully they can model the conditional distribution of the number of claims given everything that they actually ask you in the surveys that I hear you now fail in 15 minutes. All right, so now you have this Poisson distribution. And that's just the modeling assumption.

There's no particular reason why you should do this except that, you know, that might be a good idea. And the expected value of your Poisson has to be this mu i, OK? At time i.

Any question about this slide? OK, so let's switch to another example. Another example is the so-called pray capture rate.

So here, what you're interested in is the rate capture of preys yi for a given prey. And you have xy, which is your explanation. And this is just the density of pray. So you're trying to explain the rate of captures of preys given the density of the prey, OK?

And so you need to find some sort of relationship between the two. And here again, you talk to experts and what they tell you is that, well, it's going to be increasing, right? I mean, animals like predators are going to just eat more if there's more preys. But at some point, they're just going to level off because they're going to be [INAUDIBLE] full and they're going to stop capturing those prays.

And you're just going to have some phenomenon that looks like this. So here is a curve that sort of makes sense, right? As your capture rate goes from 0 to 1, you're increasing, and then you see you have this like [INAUDIBLE] function that says, you know, at some point it levels up. OK, so here, one way I could-- I mean, there's again many ways I could just model a function that looks like this.

But a simple one that has only two parameters is this one, where mu i is this a function of xi where I have some parameter alpha here and some parameter h here. OK, so there's clearly-- so this function, there's one that essentially tells you-- so this thing starts at 0 for sure. And essentially, alpha tells you how sharp this thing is, and h tells you at which points you end here.

Well, it's not exactly what those values are equal to, but that tells you this. OK, so, you know-- simple, and-- well, no, OK. Sorry, that's actually alpha, which is the maximum capture. The rate and h represent the pre-density at which the capture weight is.

So that's the half time. OK, so there's actual value [INAUDIBLE]. All right, so now I have this function.

It's certainly not a function. There's no-- I don't see it as a function of x. So I need to find something that looks like a function of x, OK?

So then here, there's no log. There's no-- well, I could actually take a log here. But I would have log of x and log of x plus h. So that would be weird.

So what we propose to do here is to look, rather than looking at mu i, we look 1 over mu i. Right, and so since your function was mu i, when you take 1 over mu i, you get h plus xi divided by alpha xi, which is h over alpha times one over xi plus 1 over alpha.

And now if I'm willing to make this transformation of variables and say, actually, I don't-- my x, whether it's the density of prey or the inverse density of prey, it really doesn't matter. I can always make this transformation when the data comes. Then I'm actually just going to think of this as being some linear function beta 0 plus beta 1, which is this guy, times 1 over xi.

And now my new variable becomes 1 over xi. And now it's linear. And the transformation I had to take was this 1 over x, which is called the reciprocal link, OK?

You can probably guess what the exponential link is going to be and things like this, all right? So we'll talk about other links that have slightly less obvious names. Now again, modeling, right?

So this was the random component. This was the easy part. Now I need to just poor in some domain knowledge about how do I think this function, this y, which is which is the rate of capture of praise, I want to understand how this thing is actually changing what is the randomness of the thing around its mean. And you know, something that-- so that comes from this textbook. The standing deviation of capture rate might be approximately proportional to the mean rate. You need to find a distribution that actually has this property. And it turns out that this happens for gamma distributions, right? In gamma distributions, just like say, for Poisson distribution, the-- well, for Poisson, the variance and mean are of the same order.

Here is the standard deviation that's of the same order as the [INAUDIBLE] for gammas. And it's a positive distribution as well. So here is a candidate. Now since we're sort of constrained to work under the exponential family of distributions, then you can just go through your list and just decide which one works best for you.

All right, third example-- so here we have binary response. Here, essentially the binary response variable indicates the presence or absence of postoperative deforming for kyphosis on children. And here, rather than having one covariance which was before, in the first example, was time, in the second example was the density, here there's three ways that you measure on children.

The first one is age of the child and the second one is the number of vertebrae involved in the operation. And the third one is the start of the range, right-- so where it is on the spine. OK, so the response variable here is, you know, did it work or not, right? I mean, that's very simple.

And so here, it's nice because the random component is the easiest one. As I said, any random variable that takes only two outcomes must be a Bernoulli, right? So that's nice there's no modeling going on here.

So you know that y given x is going to be Bernoulli, but of course, all your efforts are going to try to understand what the conditional mean of your Bernoulli, what the conditional probability of being 1 is going to be, OK? And so in particular-- so I'm just-- here, I'm spelling it out before we close those examples. I cannot say that mu of x is x transpose data for exactly this picture that I drew for you here, right?

There's just no way here-- the goal of doing this is certainly to be able to extrapolate for yet unseen children whether this is something that we should be doing. And maybe the range of x is actually going to be slightly out. And so, OK I don't want to see that have a negative probability of outcome or a positive one-- sorry, or one that's lower than one.

So I need to make this transformation. So what I need to do is to transform mu, which is, we know only a number. All we know is a number between 0 and 1. And we need to transform it in such a way that it maps the entire real line or reciprocally to say that-- or inversely, I should say-- that f of x transpose beta should be a number between 0 and 1.

I need to find a function that takes any real number and maps it into 0 and 1. And we'll see that again, but you have an army of functions that do that for you. What are those functions?

AUDIENCE: [INAUDIBLE]

PHILIPPE RIGOLLET: I'm sorry?

AUDIENCE: [INAUDIBLE]

PHILIPPE RIGOLLET: Trait?

AUDIENCE: [INAUDIBLE]

PHILIPPE RIGOLLET: Oh.

AUDIENCE: [INAUDIBLE]

PHILIPPE RIGOLLET: Yeah, I want them to be invertible, right?

AUDIENCE: [INAUDIBLE]

PHILIPPE RIGOLLET: I have an army of function. I'm not asking for one soldier in this army. I want the name of this army.

AUDIENCE: [INAUDIBLE]

PHILIPPE RIGOLLET: Well, they're not really invertible either, right? So they're actually in [INAUDIBLE] textbook. Because remember, statisticians don't know how to integrate functions, but they know how to turn a function into a Gaussian integral.

So we know it integrates to 1 and things like this. Same thing here-- we don't know how to build functions that are invertible and map the entire real line to 0, 1, but there's all the cumulative distribution functions that do that for us. So I can you any of those guys, and that's what I'm going to be doing, actually.

All right, so just to recap what I just said as we were speaking, so normal linear model is not appropriate for these examples if only because the response variable is not necessarily Gaussian and also because the linear model has to be-- the mean has to be transformed before I can actually apply a linear model for all these plausible nonlinear models that I actually came up with. OK, so the family we're going to go for is the exponential family of distributions.

And we're going to be able to show-- so one of the nice part of this is to actually compute maximum likelihood estimaters for those right? In the linear model, maximum-- like, in the Gauss linear model, maximum likelihood was as nice as it gets, right? This actually was the least squares estimator.

We had a close form. x transpose x inverse x transpose y, and that was it, OK? We had to just take one derivative.

Here, we're going to have a generally concave likelihood. We're not going to be able to actually solve this thing directly in close form unless it's Gaussian, but we will have-- we'll see actually how this is not just a black box optimization of a concave function. We have a lot of properties of this concave function, and we will be able to show some iterative algorithms.

We'll basically see how, when you opened the box of convex optimization, you will actually be able to see how things work and actually implement it using least squares. So each iteration of this iterative algorithm will essentially be a least squares, and that's actually quite [INAUDIBLE]. So, very demonstrative of statisticians being pretty ingenious so that they don't have to call in some statistical software but just can repeatedly call their least squares Oracle within a statistical software.

OK, so what is the exponential family, right? I promised to do the exponential family. Before we go into this, let me just tell you something about exponential families, and what's the only thing to differentiate an exponential family from all possible distributions?

An exponential family has two parameters, right? And those are not really parameters, but there's this theta parameter of my distribution, OK? So it's going to be indexed by some parameter.

Here, I'm only talking about the distribution of, say, some random variable or some random vector, OK? So here in this slide, you see that the parameter theta that indexed those distribution is k dimensional and the space of the x's that I'm looking at-- so that should really be y, right? What I'm going to plug in here is the conditional distribution of y given x and theta is going to depend on x.

But this really is the y. That's their distribution of the response variable. And so this is on q, right? So I'm going to assume that y takes-- q dimensional-- is q dimensional. Clearly soon, q is going to be equal to 1, but I can define those things generally.

OK, so I have this. I have to tell you what this looks like. And let's assume that this is a probability density function. So this, right this notation, the fact that I just put my theta in subscript, is just for me to remember that this is the variable that indicates the random variable, and this is just the parameter. But I could just write it as a function of theta and x, right?

This is just going to be-- right, if you were in calc, in multivariable calc, you would have two parameter of theta and x and you would need to give me a function. Now think of all-- think of x and theta as being one dimensional at this point. Think of all the functions that can be depending on theta and x.

There's many of them. And in particular, there's many ways theta and x can interact. What the exponential family does for you is that it restricts the way these things can actually interact with each other. It's essentially saying the following.

It's saying this is going to be of the form exponential-- so this exponential is really not much because I could put a log next to it. But what I want is that the way theta and x interact has to be of the form theta times x in an exponential, OK?

So that's the simplest-- that's one of the ways you can think of them interacting is you just the product of the two. Now clearly, this is not a very rich family. So what I'm allowing myself is to just slap on some terms that depend only on theta and depend only on x.

So let's just call this thing, I don't know, f of x, g of theta. OK, so here, I've restricted the way theta and x can interact. So I have something that depends only on x, something that depends only on theta. And here, I have this very specific interaction.

And that's all that exponential families are doing for you, OK? So if we go back to this slide, this is much more general, right? if I want to go from theta and x in r to theta and x theta in r-- to theta in r k and x in rq, I cannot take the product of theta and x.

I cannot even take the inner product between theta and x because they're not even of compatible dimensions. But what I can do is to first map my theta into something and map my x into something so that I actually end up having the same dimensions. And then I can take the inner product. That's the natural generalization of this simple product.

OK, so what I have is-- right, so if I want to go from theta to x, when I'm going to first do is I'm going to take theta, eta of theta-- so let's say eta1 of theta to eta k of theta. And then I'm going to actually take x becomes t1 of x all the way to tk of x.

And what I'm going to do is take the inner product-- so let's call this eta and let's call this t. And I'm going to take the inner product of eta and t, which is just the sum from j equal 1 to k of eta j of theta times tj of x. OK, so that's just a way to say I want this simple interaction but in higher dimension. The simplest way I can actually make those things happen is just by taking inner product.

OK, and so now what it's telling me is that the distribution-- so I want the exponential times something that depends only on theta and something that depends only on x. And so what it tells me is that when I'm going to take p of theta x, it's just going to be something which is exponential times the sum from j equal 1 to k of eta j theta tj of x. And then I'm going to have a function that depends only-- so let me read it for now like c of theta and then a function that depends only on x.

Let me call it h of x. And for convenience, there's no particular reason why I do that. I'm taking this function c of theta and I'm just actually pushing it in there.

So I can write c of theta as exponential minus log of 1 over c of theta, right? And now I have exponential times exponential. So I push it in, and this thing actually looks like exponential sum from j equal 1 to k of eta j theta tj of x minus log 1 over c of theta times h of x.

And this thing here, log 1 over c of theta, I call actually b of theta Because c, I called it c. But I can actually directly call this guy b, and I don't actually care about c itself. Now why don't I put back also h of x in there? Because h of x is really here to just-- how to put it-- OK, h of x and b of theta don't play the same role.

B of theta in many ways is a normalizing constant, right? I want this density to integrate to 1. If I did not have this guy, I'm not guaranteed that this thing integrates to 1. But by tweaking this function b of theta or c of theta-- they're equivalent-- I can actually ensure that this thing integrates to 1. So b of theta is just a normalizing constant.

H of x is something that's going to be funny for us. It's going to be something that allows us to be able to treat both discrete and continuous variables within the framework of exponential families. So for those that are familiar with this, this is essentially saying that that h of x is really just a change of measure. When I actually look at the density of p of theta-- this is with respect to some measure-- the fact that I just multiplied by a function of x just means that I'm not looking-- that this guy here without h of theta is not the density with respect to the original measure, but it's the density with respect to the distribution that has h as a density.

That's all I'm saying, right? So I can first transform my x's and then take the density with respect to that. If you don't want to think about densities or measures, you don't have to. This is just the way-- this is just the definition.

Is there any question about this definition? All right, so it looks complicated, but it's actually essentially the simplest way you could think about it. You want to be able to have x and theta interact and you just say, I want the interaction to be of the form exponential x times theta.

And if they're higher dimensions, I'm going to take the exponential of the function of x inner product with a function of theta. All right, so I claimed since the beginning that the Gaussian was such an example. So let's just do it.

So is the Gaussian of the-- is the interaction between theta and x in a Gaussian of the form in the product? And the answer is yes. Actually, whether I know or not what the variance is, OK? So let's start for the case where I actually do not know what the variance is.

So here, I have x is n mu sigma squared. This is all one dimensional. And here, I'm going to assume that my parameter is both mu and sigma square.

OK, so what I need to do is to have some function of mu, some function of stigma square, and take an inner product of some function of x and some other function of x. So I want to show that-- so p theta of x is what? Well, it's one over square root sigma 2 pi exponential minus x minus mu squared over 2 sigma squared, right? So that's just my Gaussian density.

And I want to say that this thing here-- so clearly, the exponential shows up already. I want to show that this is something that looks like, you know, eta 1 of-- sorry, so that was-- yeah, eta 1 of, say, mu sigma squared. So I have only two of those guys, so I'm going to need only two etas, right?

So I want it to be eta 1 of mu and sigma times t1 of x plus eta 2 mu 1 mu sigma squared times t2 of x, right? So I want to have something like that that shows up, and the only things that are left, I want them to depend either only on theta or only on x. So to find that out, we just need to expand.

OK, so I'm going to first put everything into my exponential and expand this guy. So the first term here is going to be minus x squared over 2 sigma square. The second term is going to be minus mu squared over two sigma squared. And then the cross term is going to be plus x mu divided by sigma squared.

And then I'm going to put this guy here. So I have a minus log sigma over 2 pi, OK? OK, is this-- so this term here contains an interaction between X and the parameters.

This term here contains an interaction between X and the parameters. So let me try to write them in a way that I want. This guy only depends on the parameters, this guy only depends on the parameter.

So I'm going to rearrange things. And so I claim that this is of the form x squared. Well, let's say-- do-- who's getting the minus? Eta, OK.

So it's x squared times minus 1 over 2 sigma squared plus x times mu over sigma squared, right? So that's this term here. That's this term here.

Now I need to get this guy here, and that's minus. So I'm going to write it like this-- minus, and now I have mu squared over 2 sigma squared plus log sigma square root 2 pi. And now this thing is definitely of the form t of x times-- did I call them the right way or not?

Of course not. OK, so that's going to be t2 of x times eta 2 of x eta 2 of theta. This guy is going to be t1 of x times eta 1 of theta. All right, so just a function of theta times a function of x-- just a function of theta times a function of x.

And the way combined is just by sending them. And this is going to be my d of theta. What is h of x?

AUDIENCE: 1.

PHILIPPE RIGOLLET: 1. There's one thing I can actually play with, and this is something you're going to have some three choices, right? This is not actually completely determined here is that-- for example, so when I write the log sigma square root 2 pi, this is just log of sigma plus log square root 2 pi.

So I have two choices here. Either my b becomes this guy, or-- so either I have b of theta, which is mu squared over 2 sigma squared plus log sigma square root 2 pi and h of x is equal to 1, or I have that b of theta is mu square over 2 sigma squared plus log sigma. And h of x is equal to what?

Well, I can just push this guy out, right? I can push it out of the exponential. And so it's just square root of 2 pi, which is a function of x, technically. I mean, it's a constant function of x, but it's a function.

So you can see that it's not completely clear how you're going to do the trade off, right? So the constant terms can go either in b or in h. But you know, why bother with tracking down b and h when you can actually stuff everything into one and just call h one and call it a day?

Right, so you can just forget about h. You know it's one and think about the right. H won't matter actually for estimation purposes or anything like this.

All right, so that's basically everything that's written. When stigma square is known, what's happening is that this guy here is no longer a function of theta, right? Agreed?

This is no longer a parameter. When sigma square is known, then theta is equal to mu only. There's no sigma square going on. So this-- everything depends on sigma square can be thought of as a constant.

Think one. So in particular, this term here does not belong in the interaction between x and theta. It belongs to h, right? So if sigma is known, then this guy is only a function of h-- of x.

So h of x becomes exponential x squared minus x squared over 2 sigma squared, right? That's just a function of x. Is that clear?

So if you complete this computation, what you're going to get is that your new one parameter thing is that p theta x is not equal to exponential x times mu over sigma squared minus-- well, it's still the same thing. And then you have your h of x that comes out-- x squared over 2 sigma squared. OK, so that's my h of x.

That's still my b of theta. And this is my t1 of x. And this is my eta one of theta. And remember, theta is just equal to mu in this case.

So if I ask you prove that this distribution belongs to an exponential family, you just have to work it out. Typically, it's expanding what's in the exponential and see what's-- and just write it in this term and identify all the components, right? So here, notice those guys don't even get an index anymore because there's just one of them. So I wrote eta 1 and t1, but it's really just eta and t.

Oh sorry, this guy also goes. This is also a constant, right? So it can actually just put sigma divided by sigma square root 2 pi. So h of x is what, actually? Is it the density of--

AUDIENCE: Standard [INAUDIBLE].

PHILIPPE RIGOLLET: It's not standard. It's centered. It has mean 0.

But it variance sigma squared, right? But it's the density of a Gaussian. And this is what I meant when I said h of x is really just telling you with respect to which distribution, which measure you're taking the density. And so this thing here is really telling you the density of my Gaussian with mean mu is equal to-- is this with respect to a centered Gaussian is this guy, right?

That's what it means. If this thing ends up being a density, it just means that now you just have a new measure, which is this density. So it's just saying that the density of the Gaussian with mean mu with respect to the Gaussian with mean 0 is just this [INAUDIBLE] here. All right, so let's move on. So here, as I said, you could actually do all these computations and forget about the fact that x is continuous. You can actually do it with PMFs and do it for x is discrete. This actually also tells you if you can actually get the same form for your density, which is of the form exponential times the product of the the interaction between theta and x is just taking this product, then a function only of theta and of function only of x, for the PMF, it also works.

OK, so I claim that the Bernoulli belongs to this family. So the PMF of a Bernoulli-- we say parameter p is p to the x 1 minus p to the 1 minus x, right? Because we know so that's only for x equals 0 or 1.

And the reason is because when x is equal to 0, this is 1 minus p. When x is equal to 1, this is minus 0. OK, we've seen that when we're looking at likelihoods for Bernoullis.

OK, this is not clear this is going to look like this at all. But let's do it. OK, so what does this thing look like?

Well, the first thing I want to do is to make an exponential show up. So what I'm going to write is I'm going to write p to the x as exponential x log p, right? And so I'm going to do that for the other one.

So this thing here-- so I'm going to get exponential x log p plus 1 minus x log 1 minus p. So what I need to do is to collect my terms in x and my terms in whatever parameters I have, see here if theta is equal to p. So if I do this, what I end up having is equal to exponential-- so determine x is log p minus log 1 minus p. So that's x times log p over 1 minus p.

And then the term that rest is just-- that stays is just 1 times log 1 minus p. But I want to see this as a minus something, right? It was minus b of theta.

So I'm going to write it as minus-- well, I can just keep the plus, and I'm going to do-- and that's all [INAUDIBLE]. A-ha!

Well, this is of the form exponential-- something that depends only on x times something that depends only on theta-- minus a function that depends only on theta. And then h of x is equal to 1 again. OK, so let's see.

So I have t1 of x is equal to x. That's this guy. Eta 1 of theta is equal to log p1 minus p.

And b of theta is equal to log 1 over 1 minus p, OK? And h of x is equal to 1, all right? You guys want to do Poisson, or do you want to have any homework?

It's a dilemma because that's an easy homework versus no homework at all but maybe something more difficult. OK, who wants to do it now? Who does not want to raise their hand now? Who wants to raise their hand now?

All right, so let's move on. I'll just do-- do you want to do the gammas instead in the homework? That's going to be fun.

I'm not even going to propose to do the gammas. And so this is the gamma distribution. It's brilliantly called gamma because it has the gamma function just like the beta distribution had the beta function in there.

They look very similar. One is defined over r plus, the positive real line. And remember, the beta was defined over the interval 0, 1.

And it's of the form x to some power times exponential of minus x to some-- times something, right? So there's a function of polynomial [INAUDIBLE] x where the exponent depends on the parameter. And then there's the exponential minus x times something depends on the parameters.

So this is going to also look like some function of x-- sorry, like some exponential distribution. Can somebody guess what is going to be t2 of x? Oh, those are the functions of x that show up in this product, right?

Remember when we have this-- we just need to take some transformations of x so it looks linear in those things and not in x itself. Remember, we had x squared and x, for example, in the Gaussian case. I don't know if it's still there.

Yeah, it's still there, right? t2 was x squared. What do you think x is going-- t2 of x here. So here's a hint. t1 is going to be x.

AUDIENCE: [INAUDIBLE]

PHILIPPE RIGOLLET: Yeah, [INAUDIBLE], what is going to be t1? Yeah, you can-- this one is taken. This one is taken.

What? Log x, right? Because this x to the a minus 1, I'm going to write that as exponential a minus 1 log x.

So basically, eta 1 is going to be a minus 1. Eta 2 is going to be minus 1 over b-- well, actually the opposite. And then you're going to have-- but this is actually not too complicated.

All right, then those parameters get names. a is the shape parameter, b is the scale parameter. It doesn't really matter. You have other things that are called the inverse gamma distribution, which has this form.

The difference is that the parameter alpha shows negatively there and then the inverse Gaussian distribution. You know, just densities you can come up with and they just happened to fall in this family. And there's other ones that you can actually put in there that we've seen before. The chi-square is actually part of this family.

The beta distribution is part of this family. The binomial distribution is part of this family. Well, that's easy because the Bernoulli was.

The negative binomial, which is some stopping time-- the first time you hit a certain number of successes when you flip some Bernoulli coins. So you can check for all of those, and you will see that you can actually write them as part of the exponential family. So the main goal of this slide is to convince you that this is actually a pretty broad range of distributions because it basically includes everything we've seen but not anything there-- sorry, plus more, OK? Yeah.

AUDIENCE: Is there any example of a distribution that comes up pretty often that's not in the exponential family?

PHILIPPE RIGOLLET: Yeah, like uniform.

AUDIENCE: Oh, OK, so maybe a bit more complicated than [INAUDIBLE].

Anything Anything that has a support that depends on the parameter is not going to fall-- is not going to fit in there. Right, and you can actually convince yourself why anything that has the support that does not-- that depends on the parameter is not going to be part of this guy. It's kind of a hard thing to-- in fact, you proved that it's not and you prove this rule.

That's kind of a little difficult, but the way you can convince yourself is that remember, the only interaction between x and theta that I allowed was taking the product of those guys and then the exponential, right? If you have something that depends on some parameter-- let's say you're going to see something that looks like this. Right, for uniform, it looks like this.

Well, this is not of the form exponential x times theta. There's an interaction between x and theta here, but it's actually certainly not of the form x exponential x times theta. So this is definitely not going to be part of the exponential family.

And every time you start doing things like that, it's just not going to happen. Actually, to be fair, I'm not even sure that all these guys, when you allow them to have all their parameters free, are actually going to be part of this. For example-- the beta probably is, but I'm not actually entirely convinced. There's books on experiential families.

All right, so let's go back. So here, we've put a lot of effort understanding how big, how much wider than the Gaussian distribution can we think of for the conditional distribution of our response y given x. So let's go back to the generalized linear models, right?

So [INAUDIBLE] said, OK, the random component? y has to be part of some exponential family distribution-- check. We know what this means. So now I have to understand two things.

I have to understand what is the expectation, right? Because that's actually what I model, right? I take the expectation, the conditional expectation, of y given x.

So I need to understand given this guy, it would be nice if you had some simple rules that would tell me exactly what the expectation is rather than having to do it over and over again, right? If I told you, here's a Gaussian, compute the expectation, every time you had to use that would be slightly painful. So hopefully, this thing being simple enough-- we've actually selected a class that's simple enough so that we can have rules.

Whereas as soon as they give you those parameters t1, t2, eta 1, eta 2, b and h, you can actually have some simple rules to compute the mean and variance and all those things. And so in particular, I'm interested in the mean, and I'm going to have to actually say, well, you know, this mean has to be mapped into the whole real line. So I can actually talk about modeling this function of the mean as x transpose beta.

And we saw that for the [INAUDIBLE] dataset or whatever other data sets. You actually can-- you can actually do this using the log of the reciprocal or for the-- oh, actually, we didn't do it for the Bernoulli. We'll come to this. This is the most important one, and that's called a logit it or a logistic link.

But before we go there, this was actually a very broad family, right? When I wrote this thing on the bottom board-- it's gone now, but when I wrote it in the first place, the only thing that I wrote is I wanted x times theta. Wouldn't it be nice if you have some distribution that was just x times theta, not some function of x times some function of theta?

The functions seem to be here so that they actually make things a little-- so the functions were here so that I can actually put a lot of functions there. But first of all, if I actually decide to re-parametrize my problem, I can always assume-- if I'm one dimensional, I can always assume that eta 1 of theta becomes my new theta, right? So this thing-- here for example, I could say, well, this is actually the parameter of my Bernoulli.

Let me call this guy theta, right? I could do that. Then I could say, well, here I have x that shows up here. And here since I'm talking about the response, I cannot really make any transformations.

So here, I'm going to actually talk about a specific family for which this guy is not x square or square root of x or log of x or anything I want. I'm just going to actually look at distributions for which this is x. This exponential families are called a canonical exponential family.

So in the canonical exponential family, what I have is that I have my x times theta. I'm going to allow myself some normalization factor phi, and we'll see, for example, that it's very convenient when I talk about the Gaussian, right? Because even if I know-- yeah, even if I know this guy, which I actually pull into my-- oh, that's over here, right?

Right, I know sigma squared. But I don't want to change my parameter to be mu over sigma squared. It's kind of painful. So I just take mu, and I'm going to keep this guy as being this phi over there.

And it's called the dispersion parameter from a clear analogy with the Gaussian, right? That's the variance and that's measuring dispersion. OK, so here, what I want is I'm going to think throughout this class-- so phi may be known or not.

And depending-- when it's not known, this actually might turn into some exponential family or it might not. And the main reason is because this b of theta over phi is not necessarily a function of theta over phi, right? If I actually have phi unknown, then y theta over phi has to be-- this guy has to be my new parameter. And b might not be a function of this new parameter.

OK, so in a way, it may or may not, but this is not really a concern that we're going to have because throughout this class, we're going to assume that phi is known, OK? Phi is going to be known all the time, which means that this is always an exponential family. And it's just the simplest one you could think of-- one dimensional parameter, one dimensional response, and I just have-- the product is just y times or, we used to call it x.

Now I've switched to y, but y times theta divided by phi, OK? Should I write this or this is clear to everyone what this is? Let me write it somewhere so we actually keep track of it toward the [INAUDIBLE].

OK, so this is-- remember, we had all the distributions. And then here we had the exponential family. And now we have the canonical exponential family.

It's actually much, much smaller. Well, actually, it's probably sort of a good picture. And what I have is that my density or my PMF is just exponential y times theta minus b of theta divided by phi.

And I have plus phi of-- oh, yeah, plus phi of y phi, which means that this is really-- if phi is known, h of y is just exponential c of y phi, agreed? Actually, this is the reason why it's not necessarily a canonical family.

It might not be that this depends only on y. It could depend on y and phi in some annoying way and I may not be able to break it. OK, but if phi is known, this is just a function that depends on y, agreed? In particular, I think you need-- I hope you can convince yourself that this is just a subcase of everything we've seen before.

So for example, the Gaussian when the variance is known is indeed of this form, right? So we still have it on the board. So here is my y, right? So then let me write this as f theta of y. So every x is replaceable with y, blah, blah, blah.

This is this guy. And now what I have is that this is going to be my phi. This is my parameter of theta. So I'm definitely of the form y times theta divided by phi.

And then here I have a function b that depends only on theta over phi again. So b of theta is mu squared divided by 2. OK, then it's divided by 6 sigma square. And then I have this extra stuff.

But I really don't care what it is for now. It's just something that depends only on y and known stuff. So it was just a function of y just like my h. I stuff everything in there.

The b, though, this thing here, this is actually what's important because in the canonical family, if you think about it, when you know phi-- sorry-- right, this is just y times theta scaled by a known constant-- sorry, y times theta scaled by a known constant is the first term. The second term is b of theta scaled by some known constant.

But b of theta is what's going to make the difference between the Gaussian and Bernoullis and gammas and betas-- this is all in this b of theta. b of theta contains everything that's idiosyncratic to this particular distribution. And so this is going to be important. And we will see that b of theta is going to capture information about the mean, about the variance, about likelihood, about everything.

Should I go through this computation? I mean, it's the same. We've just done it, right? So maybe it's probably better if you can redo it on your own.

All right, so the canonical exponential family also has other distributions, right? So there's the Gaussian and there's the Poisson and there's the Bernoulli. But the other ones may not be part of this, right?

In particular, think about the gamma distribution. We had this-- log x was one of the things that showed up. I mean, I cannot get rid of this log x.

I mean, that's part of it except if a is equal to 1 and I know it for sure, right? So if a is equal to 1, then I'm going to have a minus 1, which is equal to 0. So I'm going to have a minus 1 times log x, which is going to be just 0. So log x is going to vanish from here.

But if a is equal to 1, then this distribution is actually much nicer, and it actually does not even deserve the name gamma. What is it if a is equal to 1? It's an exponential, right?

Gamma 1 is equal to 1. x to the a minus 1 is equal to 1. b-- so I have exponential x over b divided by b. So 1 over b-- call it lambda. And this is just an exponential distribution.

And so every time you're going to see something-- so all these guys that don't make it to this table, they could be part of those guys, but they're just more-- they're just to-- they just have another name in this thing. All right, so you could compute the value of theta for different values, right? So again, you still have some continuous or discrete ones.

This is my b of theta. And I said this is actually really what captures my theta. This b is actually called cumulant generating function, OK?

I don't have time. I could write five slides to explain to you, but it would just only tell you why it's called cumulant generating function. It's also known as the log of the moment generating function. And the way it's called cumulant generating function is because if I start taking successive derivatives and evaluating them at 0, I get the successive cumulance of this distribution, which are some transformation of the moments.

AUDIENCE: What are you talking about again?

PHILIPPE RIGOLLET: The function b.

AUDIENCE: [INAUDIBLE]

PHILIPPE RIGOLLET: So this is just normalization. So this is just to tell you I can compute this, but I really don't care. And obviously I don't care about stuff that's complicated.

This is actually cute, and this is what completes everything. And the rest is just like some general description. You only need to tell you that the range of y is 0 to infinity, right?

And that is essentially telling me this is going to give me some hints as to which link function I should be using, right? Because the range of y tells me what the range of expectation of y is going to be. All right, so here, it tells me that the range of y is between 0 and 1. OK, so what I want to show you is that this captures a variety of different ranges that you can have.

OK, so I'm going to want to go into the likelihood. And the likelihood I'm actually going to use to compute the expectations. But since I actually don't have time to do this now, let's just go quickly through this and give you spoiler alert to make sure that you all wake up on Thursday and really, really want to think about coming here immediately.

All right, so the thing I'm going to want to do, as I said, is it would be nice if, at least for this canonical family, when I give you b, you would be able to say, oh, here is a simple computation of b that would actually give me the mean and the variance. The mean and the variance are also known as moments. b is called cumulant generating function.

So it sounds like moments being related to cumulance, I might have a path to finding those, right? And it might involve taking derivatives of b, as we'll see. The way we're going to prove this by using this thing that we've used several times.

So this property we use when we're computing, remember, the fisher information, right? We had two formulas for the fisher information. One was the expectation of the second derivative of the log likelihood, and one was negative expectation of the square-- sorry, expectation of the square, and the other one was negative the expectation of the second derivative, right?

The log likelihood is concave, so this number is negative, this number is positive. And the way we did this is by just permuting some derivative and integral here. And there was just-- we used the fact that something that looked like this, right? The log likelihood is log of f theta.

And when I take the derivative of this guy with respect to theta, then I have something that looks like the derivative divided by f theta. And if I start taking the integral against f theta of this thing, so the expectation of this thing, those things would cancel. And then I had just the integral of a derivative, which I would make a leap of faith and say that it's actually the derivative of the integral.

But this was equal to 1. So this derivative was actually equal to 0. And so that's how you got that the expectation of the derivative of the log likelihood is equal to 0.

And you do it once again and you get this guy. It's just some nice things that happen with the [INAUDIBLE] taking derivative of the log. We've done that, we'll do that again.

But once you do this, you can actually apply it. And-- missing a parenthesis over there. So when you write the log likelihood, it's just log of an exponential. Huh, that's actually pretty nice.

Just like the least squares came naturally, the least squares [INAUDIBLE] came naturally when we took the log likelihood of the Gaussians, we're going to have the same thing that happens when I take the log of the density. The exponential is going to go away, and then I'm going to use this formula. But this formula is going to actually give me an equation directly-- oh, that's where it was.

So that's the one that's missing up there. And so the expectation minus this thing is going to be equal to 0, which tells me that the expectation is just the derivative. Right, so it's still a function of theta, but it's just a derivative of b.

And the variance is just going to be the second derivative of b. But remember, this was some sort of a scaling, right? It's called the dispersion parameter.

So if I had a Gaussian and the variance of the Gaussian did not depend on the sigma squared which I stuffed in this phi, that would be certainly weird. And it cannot depend only on mu, and so this will-- for the Gaussian, this is definitely going to be equal to 1. And this is just going to be equal to my variance.

So this is just by taking the second derivative. So basically, the take-home message is that this function b captures-- by taking one derivative of the expectation and by taking two derivatives captures the variance. Another thing that's actually cool and we'll come back to this and I want to think about is if this second derivative is the variance, what can I say about this thing? What do I know about a variance?

AUDIENCE: [INAUDIBLE]

PHILIPPE RIGOLLET: Yeah, that's positive. So I know that this is positive. So what does that tell me?

Positive? That's convex, right? A function that has positive second derivative is convex.

So we're going to use that as well, all right? So yeah, I'll see you on Thursday. I have your homework.