WEBVTT
00:00:00.120 --> 00:00:02.460
The following content is
provided under a Creative
00:00:02.460 --> 00:00:03.850
Commons license.
00:00:03.850 --> 00:00:06.090
Your support will help
MIT OpenCourseWare
00:00:06.090 --> 00:00:10.180
continue to offer high quality
educational resources for free.
00:00:10.180 --> 00:00:12.720
To make a donation or to
view additional materials
00:00:12.720 --> 00:00:16.680
from hundreds of MIT courses,
visit MIT OpenCourseWare
00:00:16.680 --> 00:00:19.914
and ocw.mit.edu.
00:00:19.914 --> 00:00:22.080
PHILIPPE RIGOLLET: The
chapter is a natural capstone
00:00:22.080 --> 00:00:24.840
chapter for this entire course.
00:00:24.840 --> 00:00:26.760
We'll see some of
the things we've
00:00:26.760 --> 00:00:29.910
seen during maximum likelihood
and some of the things
00:00:29.910 --> 00:00:34.080
we've seen during linear
regression, some of the things
00:00:34.080 --> 00:00:37.020
we've seen in terms of the basic
modeling that we've had before.
00:00:37.020 --> 00:00:39.655
We're not going to go back
to much inference questions.
00:00:39.655 --> 00:00:41.280
It's really going to
be about modeling.
00:00:41.280 --> 00:00:44.355
And in a way, generalized
linear models, as the word says,
00:00:44.355 --> 00:00:47.010
are just a generalization
of linear models.
00:00:47.010 --> 00:00:49.300
And they're actually
extremely useful.
00:00:49.300 --> 00:00:51.900
They're often forgotten
about and people just
00:00:51.900 --> 00:00:54.720
jump onto machine learning
and sophisticated techniques.
00:00:54.720 --> 00:00:57.384
But those things do
the job quite well.
00:00:57.384 --> 00:00:59.550
So let's see in what sense
they are a generalization
00:00:59.550 --> 00:01:02.250
of the linear models.
00:01:02.250 --> 00:01:05.400
So remember, the linear
model looked like this.
00:01:05.400 --> 00:01:13.030
We said that y was equal to x
transpose beta plus epsilon,
00:01:13.030 --> 00:01:13.530
right?
00:01:13.530 --> 00:01:15.960
That was our linear
regression model.
00:01:15.960 --> 00:01:19.330
And it's-- another way
to say this is that if--
00:01:19.330 --> 00:01:20.970
and let's assume
that those were, say,
00:01:20.970 --> 00:01:25.230
Gaussian with mean 0 and
identity covariance matrix.
00:01:25.230 --> 00:01:26.730
Then another way
to say this is that
00:01:26.730 --> 00:01:32.700
the conditional distribution
of y given x is equal to--
00:01:32.700 --> 00:01:39.690
sorry, I a Gaussian with mean
x transpose beta and variance--
00:01:39.690 --> 00:01:43.440
well, we had a sigma squared,
which I will forget as usual--
00:01:43.440 --> 00:01:46.080
x transpose beta and
then sigma squared.
00:01:46.080 --> 00:01:50.550
OK, so here, we just assumed
that-- so what is regression
00:01:50.550 --> 00:01:54.630
is just saying I'm trying to
explain why as a function of x.
00:01:54.630 --> 00:01:57.660
Given x, I'm assuming a
distribution for the y.
00:01:57.660 --> 00:01:59.460
And this x is just
going to be here
00:01:59.460 --> 00:02:05.430
to help me model what the mean
of this Gaussian is, right?
00:02:05.430 --> 00:02:07.720
I mean, I could have
something crazy.
00:02:07.720 --> 00:02:13.570
I could have something
that looks like y given
00:02:13.570 --> 00:02:17.560
x is n0 x transpose beta.
00:02:17.560 --> 00:02:19.660
And then this could
be some other thing
00:02:19.660 --> 00:02:22.780
which looks like, I don't
know, some x transpose
00:02:22.780 --> 00:02:26.950
gamma squared
times, I don't know,
00:02:26.950 --> 00:02:30.350
x, x transpose plus identity--
00:02:30.350 --> 00:02:33.250
some crazy thing that
depends on x here, right?
00:02:33.250 --> 00:02:37.570
And we deliberately assumed that
all the thing that depends on x
00:02:37.570 --> 00:02:39.820
shows up in the mean, OK?
00:02:39.820 --> 00:02:42.520
And so what I have
here is that y
00:02:42.520 --> 00:02:45.640
given x is a Gaussian
with a mean that
00:02:45.640 --> 00:02:51.240
depends on x and covariance
matrix sigma square identity.
00:02:51.240 --> 00:02:54.699
Now the linear model
assumed a very specific form
00:02:54.699 --> 00:02:55.240
for the mean.
00:02:55.240 --> 00:02:59.190
It said I want the
mean to be equal to x
00:02:59.190 --> 00:03:01.050
transpose beta
which, remember, was
00:03:01.050 --> 00:03:10.270
the sum from, say, j equals
1 to p of beta j xj, right?
00:03:10.270 --> 00:03:13.240
It's where the xj's are
the coordinates of x.
00:03:13.240 --> 00:03:16.050
But I could do something
also more complicated, right?
00:03:16.050 --> 00:03:19.170
I could have something
that looks like instead ,
00:03:19.170 --> 00:03:28.990
replace this by, I don't know,
sum of beta j log of x to the j
00:03:28.990 --> 00:03:34.450
divided by x to the j squared
or something like this, right?
00:03:34.450 --> 00:03:37.360
I could do this as well.
00:03:37.360 --> 00:03:39.630
So there's two things
that we have assumed.
00:03:39.630 --> 00:03:41.520
The first one is
that when I look
00:03:41.520 --> 00:03:43.440
at the conditional
distribution of y given x,
00:03:43.440 --> 00:03:45.570
x affects only the mean.
00:03:45.570 --> 00:03:47.394
I also assume that
it was Gaussian
00:03:47.394 --> 00:03:48.810
and that it affects
only the mean.
00:03:48.810 --> 00:03:51.130
And the mean is affected
in a very specific way,
00:03:51.130 --> 00:03:53.670
which is linear in x, right?
00:03:53.670 --> 00:03:56.270
So this is
essentially the things
00:03:56.270 --> 00:03:58.230
we're going to try to relax.
00:03:58.230 --> 00:03:59.670
So the first thing
that we assume,
00:03:59.670 --> 00:04:03.300
the fact that y was Gaussian and
had only its mean [INAUDIBLE]
00:04:03.300 --> 00:04:07.140
dependant no x is what's
called the random component.
00:04:07.140 --> 00:04:09.435
It just says that the
response variables, you know,
00:04:09.435 --> 00:04:13.990
it sort of makes sense to
assume that they're Gaussian.
00:04:13.990 --> 00:04:17.220
And everything was
essentially captured, right?
00:04:17.220 --> 00:04:18.779
So there's this
property of Gaussians
00:04:18.779 --> 00:04:22.069
that if you tell me-- if
the variance is known,
00:04:22.069 --> 00:04:23.610
all you need to tell
me to understand
00:04:23.610 --> 00:04:25.950
exactly what the distribution
of a Gaussian is,
00:04:25.950 --> 00:04:29.110
all you need to tell me
is its expected value.
00:04:29.110 --> 00:04:31.730
All right, so
that's this mu of x.
00:04:31.730 --> 00:04:35.570
And the second thing is that
we have this link that says,
00:04:35.570 --> 00:04:38.600
well, I need to find a way
to use my x's to explain
00:04:38.600 --> 00:04:40.370
this mu you and the
link was exactly
00:04:40.370 --> 00:04:42.390
mu of x was equal
to x transpose beta.
00:04:45.770 --> 00:04:51.140
Now we are talking about
generalized linear models.
00:04:51.140 --> 00:04:56.150
So this part here where mu
of x is of the form-- the way
00:04:56.150 --> 00:05:00.620
I want my beta, my x,
to show up is linear,
00:05:00.620 --> 00:05:03.380
this will never be a question.
00:05:03.380 --> 00:05:06.030
In principle, I could
add a third point,
00:05:06.030 --> 00:05:10.250
which is just question this
part, the fact that mu of x
00:05:10.250 --> 00:05:11.310
is x transpose beta.
00:05:11.310 --> 00:05:13.640
I could have some more
complicated, nonlinear function
00:05:13.640 --> 00:05:14.300
of x.
00:05:14.300 --> 00:05:15.740
And then we'll never do
that because we're talking
00:05:15.740 --> 00:05:17.100
about generalized linear model.
00:05:17.100 --> 00:05:20.640
The only thing with generalize
are the random component,
00:05:20.640 --> 00:05:23.330
the conditional
distribution of y given x,
00:05:23.330 --> 00:05:26.870
and the link that just says,
well, once you actually tell me
00:05:26.870 --> 00:05:29.540
that the only thing I need
to figure out is the mean,
00:05:29.540 --> 00:05:32.180
I'm just going to slap it
exactly these x transpose beta
00:05:32.180 --> 00:05:36.520
thing without any transformation
of x transpose beta.
00:05:36.520 --> 00:05:37.750
So those are the two things.
00:05:40.300 --> 00:05:42.260
It will become
clear what I mean.
00:05:42.260 --> 00:05:44.450
This sounds like a
tautology, but let's just
00:05:44.450 --> 00:05:46.730
see how we could extend that.
00:05:46.730 --> 00:05:50.140
So what we're going to do in
generalized linear models--
00:05:50.140 --> 00:05:55.482
right, so when I
talk about GLNs,
00:05:55.482 --> 00:05:57.190
the first thing I'm
going to do with my x
00:05:57.190 --> 00:05:59.330
is turn it into some
x transpose beta.
00:05:59.330 --> 00:06:02.372
And that's just
the l part, right?
00:06:02.372 --> 00:06:03.830
I'm not going to
be able to change.
00:06:03.830 --> 00:06:05.030
That's the way it works.
00:06:05.030 --> 00:06:07.530
I'm not going to do
anything non-linear.
00:06:07.530 --> 00:06:09.780
But the two things
I'm going to change
00:06:09.780 --> 00:06:16.430
is this random
component, which is
00:06:16.430 --> 00:06:21.410
that y, which used to be some
Gaussian with mean mu of x
00:06:21.410 --> 00:06:24.200
here in sigma squared--
00:06:24.200 --> 00:06:26.990
so y given x, sorry--
00:06:26.990 --> 00:06:35.770
this is going to become y given
x follows some distribution.
00:06:35.770 --> 00:06:37.690
And I'm not going to
allow any distribution.
00:06:37.690 --> 00:06:40.900
I want something that comes
from the exponential family.
00:06:49.910 --> 00:06:52.400
Who knows what the exponential
family of distribution is?
00:06:52.400 --> 00:06:55.970
This is not the same thing as
the exponential distribution.
00:06:55.970 --> 00:06:58.970
It's a family of distributions.
00:06:58.970 --> 00:07:00.495
All right, so we'll see that.
00:07:00.495 --> 00:07:01.770
It's-- wow.
00:07:04.560 --> 00:07:06.420
What can that be?
00:07:06.420 --> 00:07:08.194
Oh yeah, that's
actually [INAUDIBLE]..
00:07:11.638 --> 00:07:17.050
So-- I'm sorry?
00:07:17.050 --> 00:07:19.527
AUDIENCE: [INAUDIBLE]
00:07:19.527 --> 00:07:21.360
PHILIPPE RIGOLLET: I'm
in presentation mode.
00:07:21.360 --> 00:07:23.650
That should not happen.
00:07:23.650 --> 00:07:25.130
OK, so hopefully, this is muted.
00:07:29.390 --> 00:07:32.300
So essentially, this is going
to be a family of distributions.
00:07:32.300 --> 00:07:34.442
And what makes them
exponential typically
00:07:34.442 --> 00:07:35.900
is that there's an
exponential that
00:07:35.900 --> 00:07:39.020
shows up in the definition
of the density, all right?
00:07:39.020 --> 00:07:41.000
We'll see that the
Gaussian belongs
00:07:41.000 --> 00:07:42.560
to the exponential family.
00:07:42.560 --> 00:07:44.210
But they're slightly
less expected ones
00:07:44.210 --> 00:07:48.570
because there's this crazy
thing that a to the x
00:07:48.570 --> 00:07:52.327
is exponential x log a, which
makes the potential show up
00:07:52.327 --> 00:07:53.160
without being there.
00:07:53.160 --> 00:07:54.910
So if there's an
exponential of some power,
00:07:54.910 --> 00:07:55.830
it's going to show up.
00:07:55.830 --> 00:07:56.640
But it's more than that.
00:07:56.640 --> 00:07:58.639
So we'll actually come
to this particular family
00:07:58.639 --> 00:07:59.866
of distribution.
00:07:59.866 --> 00:08:00.990
Why this particular family?
00:08:00.990 --> 00:08:02.406
Because in a way,
everything we've
00:08:02.406 --> 00:08:04.710
done for the linear
model with Gaussian
00:08:04.710 --> 00:08:08.610
is going to extend fairly
naturally to this family.
00:08:08.610 --> 00:08:11.460
All right, and it actually
also, because it encompasses
00:08:11.460 --> 00:08:13.470
pretty much everything,
all the distributions
00:08:13.470 --> 00:08:15.950
we've discussed before.
00:08:15.950 --> 00:08:19.890
All right, so the second thing
that I want to question--
00:08:19.890 --> 00:08:22.260
right, so before,
we just said, well,
00:08:22.260 --> 00:08:28.560
mu of x was directly
equal to this thing.
00:08:31.880 --> 00:08:34.260
Mu of x was directly
x transpose beta.
00:08:34.260 --> 00:08:36.530
So I knew I was going to
have an x transpose beta
00:08:36.530 --> 00:08:39.030
and I said, well, I could do
something with this x transpose
00:08:39.030 --> 00:08:42.750
beta before I used it to
explain the expected value.
00:08:42.750 --> 00:08:44.490
But I'm actually
taking it like that.
00:08:44.490 --> 00:08:52.200
Here, we're going to say, let's
extend this to some function
00:08:52.200 --> 00:08:54.000
is equal to this thing.
00:08:54.000 --> 00:08:56.790
Now admittedly, this is
not the most natural way
00:08:56.790 --> 00:08:57.600
to think about it.
00:08:57.600 --> 00:08:59.724
What you would probably
feel more comfortable doing
00:08:59.724 --> 00:09:03.870
is write something like
mu of x is a function.
00:09:03.870 --> 00:09:08.070
Let's call it f of
x transpose beta.
00:09:08.070 --> 00:09:12.850
But here, I decide
to call f g inverse.
00:09:12.850 --> 00:09:14.574
OK, let's just my g inverse.
00:09:14.574 --> 00:09:15.074
Yes.
00:09:15.074 --> 00:09:18.430
AUDIENCE: Is this different
then just [INAUDIBLE]
00:09:18.430 --> 00:09:19.430
PHILIPPE RIGOLLET: Yeah.
00:09:22.820 --> 00:09:26.855
I mean, what transformation
you want to put on your x's?
00:09:26.855 --> 00:09:35.120
AUDIENCE: [INAUDIBLE]
00:09:35.120 --> 00:09:37.620
PHILIPPE RIGOLLET: Oh
no, certainly not, right?
00:09:37.620 --> 00:09:40.820
I mean, if I give you-- if I
force you to work with x1 plus
00:09:40.820 --> 00:09:44.280
x2, you cannot work with
any function of x1 plus any
00:09:44.280 --> 00:09:46.050
function of x2, right?
00:09:46.050 --> 00:09:48.435
So this is different.
00:09:51.900 --> 00:09:55.230
All right, so-- yeah.
00:09:55.230 --> 00:09:57.900
The transformation would
be just the simple part
00:09:57.900 --> 00:09:59.400
of your linear
regression problem
00:09:59.400 --> 00:10:01.920
where you would take your
exes, transform them,
00:10:01.920 --> 00:10:03.960
and then just apply
another linear regression.
00:10:03.960 --> 00:10:04.950
This is genuinely new.
00:10:07.419 --> 00:10:08.210
Any other question?
00:10:11.040 --> 00:10:13.740
All right, so this
function g and the reason
00:10:13.740 --> 00:10:16.830
why I sort of have to, like,
stick to this slightly less
00:10:16.830 --> 00:10:18.690
natural way of defining
it is because that's
00:10:18.690 --> 00:10:21.660
g that gets a name, not g
inverse that gets a name.
00:10:21.660 --> 00:10:23.330
And the name of g is
the link function.
00:10:29.950 --> 00:10:33.530
So if I want to give you a
generalized linear model,
00:10:33.530 --> 00:10:35.250
I need to give you
two ingredients.
00:10:35.250 --> 00:10:37.630
The first one is the
random component,
00:10:37.630 --> 00:10:40.110
which is the distribution
of y given x.
00:10:40.110 --> 00:10:44.520
And it can be anything in what's
called the exponential family
00:10:44.520 --> 00:10:45.630
of distributions.
00:10:45.630 --> 00:10:47.670
So for example, I
could say, y given
00:10:47.670 --> 00:10:50.910
x is Gaussian with mean
mu x sigma identity.
00:10:50.910 --> 00:10:53.070
But I can also
tell you y given x
00:10:53.070 --> 00:10:57.580
is gamma with shared parameter
equal to alpha of x, OK?
00:10:57.580 --> 00:11:00.480
I could do some weird
things like this.
00:11:00.480 --> 00:11:03.930
And the second thing is I need
to give you a link function.
00:11:03.930 --> 00:11:08.300
And the link function is
going to become very clear
00:11:08.300 --> 00:11:09.860
how you pick a link function.
00:11:09.860 --> 00:11:12.350
And the only reason that you
actually pick a link function
00:11:12.350 --> 00:11:15.010
is because of compatibility.
00:11:15.010 --> 00:11:18.730
This mu of x, I call
it mu because mu of x
00:11:18.730 --> 00:11:21.950
is always the conditional
expectation of y given x,
00:11:21.950 --> 00:11:25.450
always, which means
that let's think
00:11:25.450 --> 00:11:27.660
of y as being a Bernoulli
random variable.
00:11:31.176 --> 00:11:32.530
Where does mu of x live?
00:11:37.430 --> 00:11:38.410
AUDIENCE: [INAUDIBLE]
00:11:38.410 --> 00:11:39.100
PHILIPPE RIGOLLET: 0, 1, right?
00:11:39.100 --> 00:11:40.683
That's the expectation
of a Bernoulli.
00:11:40.683 --> 00:11:43.630
It's just the probability
that my coin flip gives me 1.
00:11:43.630 --> 00:11:45.630
So it's a number
between 0 and 1.
00:11:45.630 --> 00:11:49.960
But this guy right here, if
my x's are anything, right--
00:11:49.960 --> 00:11:52.540
think of any body
measurements plus [INAUDIBLE]
00:11:52.540 --> 00:11:55.520
linear combinations with
arbitrarily large coefficients.
00:11:55.520 --> 00:11:57.860
This thing can be
any real number.
00:11:57.860 --> 00:12:01.180
So the link function, what
it's effectively going to do
00:12:01.180 --> 00:12:03.060
is make those two
things compatible.
00:12:03.060 --> 00:12:04.570
It's going to take
my number which,
00:12:04.570 --> 00:12:07.270
for example, is constrained
to be between 0 and 1
00:12:07.270 --> 00:12:11.006
and map it into the
entire real line.
00:12:11.006 --> 00:12:13.380
If I have mu which is forced
to be positive, for example,
00:12:13.380 --> 00:12:16.830
in an exponential distribution,
the mean is positive, right?
00:12:16.830 --> 00:12:20.850
That's the, say, don't
know, inter-arrival time
00:12:20.850 --> 00:12:22.500
for Poisson process.
00:12:22.500 --> 00:12:25.310
This thing is known to be
positive for an exponential.
00:12:25.310 --> 00:12:27.060
I need to map something
that's exponential
00:12:27.060 --> 00:12:28.072
to the entire real line.
00:12:28.072 --> 00:12:30.030
I need a function that
takes something positive
00:12:30.030 --> 00:12:31.260
and [INAUDIBLE] everywhere.
00:12:31.260 --> 00:12:32.520
So we'll see.
00:12:32.520 --> 00:12:34.020
By the end of this
chapter, you will
00:12:34.020 --> 00:12:36.900
have 100 ways of doing this, but
there are some more traditional
00:12:36.900 --> 00:12:38.560
ones [INAUDIBLE].
00:12:38.560 --> 00:12:41.110
So before we go any further,
I gave you the example
00:12:41.110 --> 00:12:46.809
of a Bernoulli random variable.
00:12:46.809 --> 00:12:48.850
Let's see a few examples
that actually fit there.
00:12:48.850 --> 00:12:49.349
Yes.
00:12:51.104 --> 00:12:53.509
AUDIENCE: Will it come up
later [INAUDIBLE] already know
00:12:53.509 --> 00:12:56.154
why do we need the
transformer [INAUDIBLE] why
00:12:56.154 --> 00:12:59.300
don't [INAUDIBLE]
00:12:59.300 --> 00:13:01.810
PHILIPPE RIGOLLET:
Well actually, this
00:13:01.810 --> 00:13:02.830
will not come up later.
00:13:02.830 --> 00:13:04.510
It should be very
clear from here
00:13:04.510 --> 00:13:06.070
because if I actually
have a model,
00:13:06.070 --> 00:13:08.290
I just want it to
be plausible, right?
00:13:08.290 --> 00:13:11.040
I mean, what happens if I
suddenly decide that my--
00:13:11.040 --> 00:13:12.669
so this is what's
going to happen.
00:13:12.669 --> 00:13:14.710
You're going to have only
data to fit this model.
00:13:14.710 --> 00:13:17.530
Let's say you actually
forget about this thing here.
00:13:17.530 --> 00:13:19.060
You can always do this, right?
00:13:19.060 --> 00:13:23.974
You can always say I'm
going to pretend my y's just
00:13:23.974 --> 00:13:26.140
happen to be the realizations
of said Gaussians that
00:13:26.140 --> 00:13:28.270
happen to be 0 or 1 only.
00:13:28.270 --> 00:13:32.020
You can always, like, stuff that
in some linear model, right?
00:13:32.020 --> 00:13:35.760
You will have some least
squares estimated for beta.
00:13:35.760 --> 00:13:36.880
And it's going to be fine.
00:13:36.880 --> 00:13:38.630
For all the points
that you see, it
00:13:38.630 --> 00:13:40.270
will definitely put
some number that's
00:13:40.270 --> 00:13:42.016
actually between 0 and 1.
00:13:42.016 --> 00:13:44.140
So this is what your picture
is going to look like.
00:13:44.140 --> 00:13:48.795
You're going to have a
bunch of values for x.
00:13:48.795 --> 00:13:50.169
This is your y.
00:13:50.169 --> 00:13:51.960
And for different-- so
these are the values
00:13:51.960 --> 00:13:53.430
of x that you will get.
00:13:53.430 --> 00:13:55.920
And for a y, you will see
either a 0 or a 1, right?
00:13:59.180 --> 00:14:02.990
Right, that's what your
Bernoulli dataset would look
00:14:02.990 --> 00:14:05.210
like with a one dimensional x.
00:14:05.210 --> 00:14:09.680
Now if you do least squares
on this, you will find this.
00:14:09.680 --> 00:14:11.360
And for this guy,
this line certainly
00:14:11.360 --> 00:14:14.290
takes values between 0 and 1.
00:14:14.290 --> 00:14:16.242
But let's say now
you get an x here.
00:14:16.242 --> 00:14:17.950
You're going to actually
start pretending
00:14:17.950 --> 00:14:20.930
that the probability it spits
out one conditionally in x
00:14:20.930 --> 00:14:22.910
is like 1.2, and that's
going to be weird.
00:14:28.310 --> 00:14:31.240
Any other questions?
00:14:31.240 --> 00:14:34.700
All right, so let's
start with some examples.
00:14:34.700 --> 00:14:38.790
Right, I mean, you get so used
to them through this course.
00:14:38.790 --> 00:14:41.250
So the first one is--
00:14:41.250 --> 00:14:42.500
so all these things are taken.
00:14:42.500 --> 00:14:44.124
So there's a few
books on generalizing,
00:14:44.124 --> 00:14:45.950
your models, generalize
[INAUDIBLE] models.
00:14:45.950 --> 00:14:48.920
And there's tons of
applications that you can see.
00:14:48.920 --> 00:14:50.990
Those are extremely
versatile, and as soon
00:14:50.990 --> 00:14:53.775
as you want to do modeling
to explain some y given x,
00:14:53.775 --> 00:14:55.400
you sort of need to
do that if you want
00:14:55.400 --> 00:14:58.040
to go beyond linear models.
00:14:58.040 --> 00:15:00.610
So this was in the
disease occurring rate.
00:15:00.610 --> 00:15:04.340
So you have a disease
epidemic and you
00:15:04.340 --> 00:15:08.390
want to basically model
the expected number
00:15:08.390 --> 00:15:11.390
of new cases given--
00:15:11.390 --> 00:15:13.100
at a certain time, OK?
00:15:13.100 --> 00:15:16.190
So you have time that progresses
for each of your reservation.
00:15:16.190 --> 00:15:18.500
Each of your reservation
is a time stamp--
00:15:18.500 --> 00:15:21.410
say, I don't know, 20th day.
00:15:21.410 --> 00:15:26.354
And your response is
the number of new cases.
00:15:26.354 --> 00:15:28.520
And you're going to actually
put your model directly
00:15:28.520 --> 00:15:29.480
on mu, right?
00:15:29.480 --> 00:15:31.970
When I looked at
this, everything here
00:15:31.970 --> 00:15:34.460
was on mu itself, on
the expected, right?
00:15:34.460 --> 00:15:36.410
Mu of x is always the expected--
00:15:39.609 --> 00:15:42.230
the conditional
expectation of y given x.
00:15:45.280 --> 00:15:45.890
right?
00:15:45.890 --> 00:15:51.750
So all I need to model
is this expected value.
00:15:51.750 --> 00:15:54.860
So this mu I'm going
to actually say--
00:15:54.860 --> 00:15:57.620
so I look at some parameters,
and it says, well,
00:15:57.620 --> 00:16:00.489
it increases exponentially.
00:16:00.489 --> 00:16:02.780
So I want to say I have some
sort of exponential trend.
00:16:02.780 --> 00:16:04.820
I can parametrize
that in several ways.
00:16:04.820 --> 00:16:06.500
And the two parameters
I want to slap in
00:16:06.500 --> 00:16:10.190
is, like, some sort of gamma,
which is just the coefficient.
00:16:10.190 --> 00:16:13.920
And then there's some rate
delta that's in the exponential.
00:16:13.920 --> 00:16:15.650
So if I tell you
it's exponential,
00:16:15.650 --> 00:16:17.330
that's a nice family
of functions you
00:16:17.330 --> 00:16:18.710
might want to think about, OK?
00:16:18.710 --> 00:16:24.520
So here, mu of x, if I want
to keep the notation, x
00:16:24.520 --> 00:16:30.650
is gamma exponential
delta x, right?
00:16:30.650 --> 00:16:34.600
Except that here, my x
are t1, t2, t3, et cetera.
00:16:34.600 --> 00:16:37.340
And I want to find what the
parameters gamma and delta are
00:16:37.340 --> 00:16:40.040
because I want to be
able to maybe compare
00:16:40.040 --> 00:16:42.980
different epidemics and see if
they have the same parameter
00:16:42.980 --> 00:16:46.670
or maybe just do some
prediction based on the data
00:16:46.670 --> 00:16:49.070
that I have without-- to
extrapolate in the future.
00:16:52.020 --> 00:16:58.280
So here, clearly mu of
x is not of the form
00:16:58.280 --> 00:17:01.970
x transpose beta, right?
00:17:01.970 --> 00:17:04.410
That's not x
transpose beta at all.
00:17:04.410 --> 00:17:07.579
And it's actually not even a
function of x transpose data,
00:17:07.579 --> 00:17:08.210
right?
00:17:08.210 --> 00:17:09.900
There's two parameters,
gamma and delta,
00:17:09.900 --> 00:17:11.849
and it's not of the form.
00:17:11.849 --> 00:17:14.359
So here we have x,
which is 1 and x, right?
00:17:14.359 --> 00:17:16.200
I have two parameters.
00:17:16.200 --> 00:17:17.970
So what I do here
is that I say, well,
00:17:17.970 --> 00:17:20.640
first, let me transform
mu in such a way
00:17:20.640 --> 00:17:23.119
that I can hope to see
something that's linear.
00:17:23.119 --> 00:17:26.819
So if I transform mu, I'm
going to have log of mu, which
00:17:26.819 --> 00:17:28.099
is log of this thing, right?
00:17:28.099 --> 00:17:33.770
So log of mu of
x is equal, well,
00:17:33.770 --> 00:17:36.950
to log of gamma plus
log of exponential delta
00:17:36.950 --> 00:17:39.350
x, which is delta x.
00:17:42.050 --> 00:17:46.190
And now this thing is
actually linear in x.
00:17:46.190 --> 00:17:49.440
So I have that this
guy is my first beta 1.
00:17:49.440 --> 00:17:50.990
And so that's beta 1 finds 1.
00:17:50.990 --> 00:17:53.320
And this guy is beta 2--
00:17:53.320 --> 00:17:55.950
times, sorry that said beta
0-- times 1, and this guy
00:17:55.950 --> 00:17:58.040
is beta 1 times x.
00:17:58.040 --> 00:18:00.200
OK, so that looks
like a linear model.
00:18:00.200 --> 00:18:02.330
I just have to change
my parameters--
00:18:02.330 --> 00:18:05.840
my parameters beta 1 becomes
the log of gamma and beta 2
00:18:05.840 --> 00:18:08.180
becomes delta itself.
00:18:08.180 --> 00:18:11.210
And the reason why we do this
is because, well, the way
00:18:11.210 --> 00:18:13.737
we put those gamma and those
delta was just so that we
00:18:13.737 --> 00:18:14.820
have some parametrization.
00:18:14.820 --> 00:18:17.300
It just so happens that if
we want this to be linear,
00:18:17.300 --> 00:18:20.052
we need to just change the
parametrization itself.
00:18:20.052 --> 00:18:21.510
This is going to
have some effects.
00:18:21.510 --> 00:18:23.301
We know that it's going
to have some effect
00:18:23.301 --> 00:18:24.531
in the fissure information.
00:18:24.531 --> 00:18:27.030
It's going to have a bunch of
effect to change those things.
00:18:27.030 --> 00:18:29.510
But that's what needs
to be done to have
00:18:29.510 --> 00:18:32.240
a generalized linear model.
00:18:32.240 --> 00:18:35.460
Now here, the
function that I took
00:18:35.460 --> 00:18:37.690
to turn it into something
that's linear is simple.
00:18:37.690 --> 00:18:41.000
It came directly from some
natural thing I would do here,
00:18:41.000 --> 00:18:42.430
which is taking the log.
00:18:42.430 --> 00:18:44.500
And so the function g,
the link that I take,
00:18:44.500 --> 00:18:47.530
is called the log
link very creatively.
00:18:47.530 --> 00:18:49.750
And it's just the
function that I
00:18:49.750 --> 00:18:52.200
apply to mu so that I see
something that's linear
00:18:52.200 --> 00:18:53.260
and that looks like this.
00:18:59.580 --> 00:19:03.890
So now this only tells me how
to deal with the link function.
00:19:03.890 --> 00:19:06.380
But I still have
to deal with 0.1.
00:19:06.380 --> 00:19:08.960
And this, again, is
just some modeling.
00:19:08.960 --> 00:19:11.090
Given some data,
some random data,
00:19:11.090 --> 00:19:14.630
what distribution do you choose
to explain the randomness?
00:19:14.630 --> 00:19:17.600
And this-- I mean,
unless there's no choice,
00:19:17.600 --> 00:19:19.820
you know, it's just a
matter of practice, right?
00:19:19.820 --> 00:19:22.100
I mean, why would it be
Gaussian and not, you know,
00:19:22.100 --> 00:19:23.540
doubly exponential?
00:19:23.540 --> 00:19:25.472
This is-- there's matters
of convenience that
00:19:25.472 --> 00:19:27.680
come into this, and there's
just matter of experience
00:19:27.680 --> 00:19:29.780
that come into this.
00:19:29.780 --> 00:19:32.660
You know, I remember when
you chat with engineers,
00:19:32.660 --> 00:19:34.416
they have a very
good notion of what
00:19:34.416 --> 00:19:35.540
the distribution should be.
00:19:35.540 --> 00:19:37.970
They have y bold distributions.
00:19:37.970 --> 00:19:39.909
You know, they do optics
and things like this.
00:19:39.909 --> 00:19:42.450
So there's some distributions
that just come up but sometimes
00:19:42.450 --> 00:19:43.640
just have to work.
00:19:43.640 --> 00:19:45.380
Now here what do we have?
00:19:45.380 --> 00:19:47.720
The thing we're
trying to measure, y--
00:19:47.720 --> 00:19:49.790
as we said, so mu
is the expectation,
00:19:49.790 --> 00:19:52.070
the conditional
expectation, of y given x.
00:19:52.070 --> 00:19:56.090
But y is the number
of new cases, right?
00:19:56.090 --> 00:19:57.560
Well it's a number of.
00:19:57.560 --> 00:19:59.060
And the first thing
you should think
00:19:59.060 --> 00:20:00.980
of when you think
about number of,
00:20:00.980 --> 00:20:03.620
if it were bounded above, you
would think binomial, baby.
00:20:03.620 --> 00:20:05.220
But here, it's just a number.
00:20:05.220 --> 00:20:06.640
So you think Poisson.
00:20:06.640 --> 00:20:08.750
That's how insurers think.
00:20:08.750 --> 00:20:13.030
I have a number of, you
know, claims per year.
00:20:13.030 --> 00:20:15.570
This is a Poisson distribution.
00:20:15.570 --> 00:20:18.062
And hopefully they can model
the conditional distribution
00:20:18.062 --> 00:20:20.520
of the number of claims given
everything that they actually
00:20:20.520 --> 00:20:24.940
ask you in the
surveys that I hear
00:20:24.940 --> 00:20:26.980
you now fail in 15 minutes.
00:20:26.980 --> 00:20:31.924
All right, so now you have
this Poisson distribution.
00:20:31.924 --> 00:20:33.590
And that's just the
modeling assumption.
00:20:33.590 --> 00:20:34.840
There's no particular
reason why you
00:20:34.840 --> 00:20:36.450
should do this except
that, you know,
00:20:36.450 --> 00:20:38.050
that might be a good idea.
00:20:38.050 --> 00:20:39.700
And the expected
value of your Poisson
00:20:39.700 --> 00:20:42.915
has to be this mu i, OK?
00:20:42.915 --> 00:20:46.330
At time i.
00:20:46.330 --> 00:20:48.760
Any question about this slide?
00:20:48.760 --> 00:20:51.660
OK, so let's switch
to another example.
00:20:51.660 --> 00:20:54.870
Another example is the
so-called pray capture rate.
00:20:54.870 --> 00:20:58.010
So here, what
you're interested in
00:20:58.010 --> 00:21:05.330
is the rate capture of
preys yi for a given prey.
00:21:05.330 --> 00:21:10.730
And you have xy, which
is your explanation.
00:21:10.730 --> 00:21:12.275
And this is just
the density of pray.
00:21:12.275 --> 00:21:17.030
So you're trying to explain the
rate of captures of preys given
00:21:17.030 --> 00:21:20.060
the density of the prey, OK?
00:21:20.060 --> 00:21:22.964
And so you need to find
some sort of relationship
00:21:22.964 --> 00:21:23.630
between the two.
00:21:23.630 --> 00:21:25.250
And here again,
you talk to experts
00:21:25.250 --> 00:21:27.570
and what they tell you
is that, well, it's
00:21:27.570 --> 00:21:28.820
going to be increasing, right?
00:21:28.820 --> 00:21:32.450
I mean, animals like predators
are going to just eat more
00:21:32.450 --> 00:21:34.239
if there's more preys.
00:21:34.239 --> 00:21:35.780
But at some point,
they're just going
00:21:35.780 --> 00:21:38.450
to level off because they're
going to be [INAUDIBLE] full
00:21:38.450 --> 00:21:42.380
and they're going to stop
capturing those prays.
00:21:42.380 --> 00:21:44.804
And you're just going to
have some phenomenon that
00:21:44.804 --> 00:21:45.470
looks like this.
00:21:45.470 --> 00:21:47.870
So here is a curve that
sort of makes sense, right?
00:21:47.870 --> 00:21:52.130
As your capture rate goes from
0 to 1, you're increasing,
00:21:52.130 --> 00:21:54.530
and then you see you have
this like [INAUDIBLE] function
00:21:54.530 --> 00:21:57.630
that says, you know, at
some point it levels up.
00:21:57.630 --> 00:21:59.490
OK, so here, one way I could--
00:21:59.490 --> 00:22:01.590
I mean, there's again
many ways I could just
00:22:01.590 --> 00:22:03.300
model a function
that looks like this.
00:22:03.300 --> 00:22:05.910
But a simple one that
has only two parameters
00:22:05.910 --> 00:22:09.930
is this one, where mu i is
this a function of xi where
00:22:09.930 --> 00:22:13.230
I have some parameter alpha
here and some parameter h here.
00:22:13.230 --> 00:22:15.820
OK, so there's clearly--
00:22:15.820 --> 00:22:21.240
so this function, there's one
that essentially tells you--
00:22:21.240 --> 00:22:23.880
so this thing starts
at 0 for sure.
00:22:23.880 --> 00:22:25.770
And essentially,
alpha tells you how
00:22:25.770 --> 00:22:28.170
sharp this thing
is, and h tells you
00:22:28.170 --> 00:22:30.180
at which points you end here.
00:22:30.180 --> 00:22:32.460
Well, it's not exactly what
those values are equal to,
00:22:32.460 --> 00:22:35.380
but that tells you this.
00:22:35.380 --> 00:22:41.329
OK, so, you know-- simple, and--
00:22:41.329 --> 00:22:41.870
well, no, OK.
00:22:41.870 --> 00:22:44.360
Sorry, that's actually alpha,
which is the maximum capture.
00:22:44.360 --> 00:22:46.450
The rate and h represent
the pre-density
00:22:46.450 --> 00:22:47.830
at which the capture weight is.
00:22:47.830 --> 00:22:49.730
So that's the half time.
00:22:49.730 --> 00:22:52.600
OK, so there's actual
value [INAUDIBLE]..
00:22:52.600 --> 00:22:54.500
All right, so now I
have this function.
00:22:54.500 --> 00:22:56.930
It's certainly not a function.
00:22:56.930 --> 00:22:59.330
There's no-- I don't see
it as a function of x.
00:22:59.330 --> 00:23:06.390
So I need to find something that
looks like a function of x, OK?
00:23:06.390 --> 00:23:08.340
So then here, there's no log.
00:23:08.340 --> 00:23:13.570
There's no-- well, I could
actually take a log here.
00:23:13.570 --> 00:23:15.890
But I would have log of
x and log of x plus h.
00:23:15.890 --> 00:23:17.600
So that would be weird.
00:23:17.600 --> 00:23:19.990
So what we propose to
do here is to look,
00:23:19.990 --> 00:23:23.350
rather than looking at mu
i, we look 1 over mu i.
00:23:23.350 --> 00:23:24.890
Right, and so
since your function
00:23:24.890 --> 00:23:37.450
was mu i, when you
take 1 over mu i,
00:23:37.450 --> 00:23:42.580
you get h plus xi divided
by alpha xi, which
00:23:42.580 --> 00:23:49.690
is h over alpha times one
over xi plus 1 over alpha.
00:23:49.690 --> 00:23:52.320
And now if I'm willing to
make this transformation
00:23:52.320 --> 00:23:54.330
of variables and say,
actually, I don't--
00:23:54.330 --> 00:23:57.900
my x, whether it's
the density of prey
00:23:57.900 --> 00:24:00.759
or the inverse density of
prey, it really doesn't matter.
00:24:00.759 --> 00:24:02.300
I can always make
this transformation
00:24:02.300 --> 00:24:03.750
when the data comes.
00:24:03.750 --> 00:24:06.210
Then I'm actually just
going to think of this
00:24:06.210 --> 00:24:11.400
as being some linear
function beta 0 plus beta 1,
00:24:11.400 --> 00:24:17.345
which is this guy,
times 1 over xi.
00:24:17.345 --> 00:24:20.080
And now my new variable
becomes 1 over xi.
00:24:20.080 --> 00:24:21.260
And now it's linear.
00:24:21.260 --> 00:24:23.350
And the transformation
I had to take
00:24:23.350 --> 00:24:34.240
was this 1 over x, which is
called the reciprocal link, OK?
00:24:34.240 --> 00:24:37.120
You can probably guess what the
exponential link is going to be
00:24:37.120 --> 00:24:38.453
and things like this, all right?
00:24:38.453 --> 00:24:41.380
So we'll talk about other
links that have slightly less
00:24:41.380 --> 00:24:43.180
obvious names.
00:24:43.180 --> 00:24:45.206
Now again, modeling, right?
00:24:45.206 --> 00:24:46.580
So this was the
random component.
00:24:46.580 --> 00:24:47.690
This was the easy part.
00:24:47.690 --> 00:24:50.920
Now I need to just poor
in some domain knowledge
00:24:50.920 --> 00:24:55.900
about how do I think this
function, this y, which
00:24:55.900 --> 00:25:01.810
is which is the rate
of capture of praise,
00:25:01.810 --> 00:25:05.162
I want to understand how
this thing is actually
00:25:05.162 --> 00:25:09.430
changing what is the randomness
of the thing around its mean.
00:25:09.430 --> 00:25:11.550
And you know, something
that-- so that
00:25:11.550 --> 00:25:12.647
comes from this textbook.
00:25:12.647 --> 00:25:14.230
The standing deviation
of capture rate
00:25:14.230 --> 00:25:16.750
might be approximately
proportional to the mean rate.
00:25:16.750 --> 00:25:18.250
You need to find a
distribution that
00:25:18.250 --> 00:25:19.390
actually has this property.
00:25:19.390 --> 00:25:21.160
And it turns out
that this happens
00:25:21.160 --> 00:25:23.950
for gamma distributions, right?
00:25:23.950 --> 00:25:26.050
In gamma distributions,
just like say,
00:25:26.050 --> 00:25:29.740
for Poisson distribution, the--
00:25:29.740 --> 00:25:32.579
well, for Poisson, the variance
and mean are of the same order.
00:25:32.579 --> 00:25:34.120
Here is the standard
deviation that's
00:25:34.120 --> 00:25:39.540
of the same order as the
[INAUDIBLE] for gammas.
00:25:39.540 --> 00:25:42.300
And it's a positive
distribution as well.
00:25:42.300 --> 00:25:43.790
So here is a candidate.
00:25:43.790 --> 00:25:45.260
Now since we're
sort of constrained
00:25:45.260 --> 00:25:48.777
to work under the exponential
family of distributions,
00:25:48.777 --> 00:25:50.360
then you can just
go through your list
00:25:50.360 --> 00:25:52.430
and just decide which
one works best for you.
00:25:55.250 --> 00:25:56.940
All right, third example--
00:25:56.940 --> 00:25:59.270
so here we have binary response.
00:25:59.270 --> 00:26:01.370
Here, essentially the
binary response variable
00:26:01.370 --> 00:26:02.960
indicates the
presence or absence
00:26:02.960 --> 00:26:07.460
of postoperative deforming
for kyphosis on children.
00:26:07.460 --> 00:26:10.310
And here, rather than having
one covariance which was before,
00:26:10.310 --> 00:26:12.950
in the first example, was
time, in the second example
00:26:12.950 --> 00:26:15.230
was the density, here
there's three ways
00:26:15.230 --> 00:26:17.030
that you measure on children.
00:26:17.030 --> 00:26:19.510
The first one is
age of the child
00:26:19.510 --> 00:26:21.440
and the second one is
the number of vertebrae
00:26:21.440 --> 00:26:23.040
involved in the operation.
00:26:23.040 --> 00:26:25.260
And the third one is
the start of the range,
00:26:25.260 --> 00:26:29.660
right-- so where
it is on the spine.
00:26:29.660 --> 00:26:35.105
OK, so the response
variable here is, you know,
00:26:35.105 --> 00:26:36.440
did it work or not, right?
00:26:36.440 --> 00:26:37.970
I mean, that's very simple.
00:26:37.970 --> 00:26:41.859
And so here, it's nice
because the random component
00:26:41.859 --> 00:26:42.650
is the easiest one.
00:26:42.650 --> 00:26:45.680
As I said, any random variable
that takes only two outcomes
00:26:45.680 --> 00:26:49.020
must be a Bernoulli, right?
00:26:49.020 --> 00:26:52.004
So that's nice there's no
modeling going on here.
00:26:52.004 --> 00:26:54.170
So you know that y given x
is going to be Bernoulli,
00:26:54.170 --> 00:26:55.628
but of course, all
your efforts are
00:26:55.628 --> 00:26:58.190
going to try to understand
what the conditional mean
00:26:58.190 --> 00:27:00.315
of your Bernoulli, what
the conditional probability
00:27:00.315 --> 00:27:02.090
of being 1 is going to be, OK?
00:27:02.090 --> 00:27:05.960
And so in particular--
so I'm just-- here,
00:27:05.960 --> 00:27:08.990
I'm spelling it out before
we close those examples.
00:27:08.990 --> 00:27:12.560
I cannot say that mu of x is x
transpose data for exactly this
00:27:12.560 --> 00:27:15.520
picture that I drew
for you here, right?
00:27:15.520 --> 00:27:17.500
There's just no
way here-- the goal
00:27:17.500 --> 00:27:20.050
of doing this is certainly
to be able to extrapolate
00:27:20.050 --> 00:27:23.650
for yet unseen children
whether this is something
00:27:23.650 --> 00:27:24.850
that we should be doing.
00:27:24.850 --> 00:27:27.340
And maybe the range
of x is actually
00:27:27.340 --> 00:27:28.480
going to be slightly out.
00:27:28.480 --> 00:27:30.550
And so, OK I don't
want to see that have
00:27:30.550 --> 00:27:34.770
a negative probability of
outcome or a positive one--
00:27:34.770 --> 00:27:38.590
sorry, or one that's
lower than one.
00:27:38.590 --> 00:27:40.970
So I need to make
this transformation.
00:27:40.970 --> 00:27:43.700
So what I need to do is
to transform mu, which
00:27:43.700 --> 00:27:44.930
is, we know only a number.
00:27:44.930 --> 00:27:46.880
All we know is a
number between 0 and 1.
00:27:46.880 --> 00:27:48.590
And we need to transform
it in such a way
00:27:48.590 --> 00:27:50.510
that it maps the
entire real line
00:27:50.510 --> 00:27:57.270
or reciprocally to say that--
00:27:57.270 --> 00:27:58.850
or inversely, I should say--
00:27:58.850 --> 00:28:00.650
that f of x
transpose beta should
00:28:00.650 --> 00:28:02.410
be a number between 0 and 1.
00:28:02.410 --> 00:28:05.000
I need to find a function
that takes any real number
00:28:05.000 --> 00:28:06.950
and maps it into 0 and 1.
00:28:06.950 --> 00:28:10.460
And we'll see that
again, but you
00:28:10.460 --> 00:28:12.480
have an army of functions
that do that for you.
00:28:12.480 --> 00:28:13.761
What are those functions?
00:28:16.707 --> 00:28:17.689
AUDIENCE: [INAUDIBLE]
00:28:17.689 --> 00:28:19.162
PHILIPPE RIGOLLET: I'm sorry?
00:28:19.162 --> 00:28:20.085
AUDIENCE: [INAUDIBLE]
00:28:20.085 --> 00:28:21.126
PHILIPPE RIGOLLET: Trait?
00:28:21.126 --> 00:28:22.665
AUDIENCE: [INAUDIBLE]
00:28:22.665 --> 00:28:23.581
PHILIPPE RIGOLLET: Oh.
00:28:23.581 --> 00:28:25.518
AUDIENCE: [INAUDIBLE]
00:28:25.518 --> 00:28:28.059
PHILIPPE RIGOLLET: Yeah, I want
them to be invertible, right?
00:28:28.059 --> 00:28:34.074
AUDIENCE: [INAUDIBLE]
00:28:34.074 --> 00:28:35.990
PHILIPPE RIGOLLET: I
have an army of function.
00:28:35.990 --> 00:28:39.100
I'm not asking for one
soldier in this army.
00:28:39.100 --> 00:28:41.860
I want the name of this army.
00:28:41.860 --> 00:28:44.057
AUDIENCE: [INAUDIBLE]
00:28:44.057 --> 00:28:46.640
PHILIPPE RIGOLLET: Well, they're
not really invertible either,
00:28:46.640 --> 00:28:48.860
right?
00:28:48.860 --> 00:28:53.730
So they're actually in
[INAUDIBLE] textbook.
00:28:53.730 --> 00:28:55.272
Because remember,
statisticians don't
00:28:55.272 --> 00:28:56.980
know how to integrate
functions, but they
00:28:56.980 --> 00:28:59.250
know how to turn a function
into a Gaussian integral.
00:28:59.250 --> 00:29:01.722
So we know it integrates
to 1 and things like this.
00:29:01.722 --> 00:29:03.180
Same thing here--
we don't know how
00:29:03.180 --> 00:29:06.692
to build functions that
are invertible and map
00:29:06.692 --> 00:29:08.400
the entire real line
to 0, 1, but there's
00:29:08.400 --> 00:29:11.350
all the cumulative distribution
functions that do that for us.
00:29:11.350 --> 00:29:13.190
So I can you any of
those guys, and that's
00:29:13.190 --> 00:29:16.330
what I'm going to
be doing, actually.
00:29:16.330 --> 00:29:19.730
All right, so just
to recap what I just
00:29:19.730 --> 00:29:23.870
said as we were speaking, so
normal linear model is not
00:29:23.870 --> 00:29:30.470
appropriate for these examples
if only because the response
00:29:30.470 --> 00:29:34.340
variable is not
necessarily Gaussian
00:29:34.340 --> 00:29:37.430
and also because the
linear model has to be--
00:29:37.430 --> 00:29:39.600
the mean has to be transformed
before I can actually
00:29:39.600 --> 00:29:42.210
apply a linear model for all
these plausible nonlinear
00:29:42.210 --> 00:29:44.890
models that I
actually came up with.
00:29:44.890 --> 00:29:48.080
OK, so the family
we're going to go for
00:29:48.080 --> 00:29:50.780
is the exponential
family of distributions.
00:29:50.780 --> 00:29:54.130
And we're going to
be able to show--
00:29:54.130 --> 00:29:56.120
so one of the nice
part of this is
00:29:56.120 --> 00:29:58.300
to actually compute
maximum likelihood
00:29:58.300 --> 00:29:59.570
estimaters for those right?
00:29:59.570 --> 00:30:02.390
In the linear model,
maximum-- like, in the Gauss
00:30:02.390 --> 00:30:05.360
linear model, maximum likelihood
was as nice as it gets, right?
00:30:05.360 --> 00:30:08.810
This actually was the
least squares estimator.
00:30:08.810 --> 00:30:10.220
We had a close form.
00:30:10.220 --> 00:30:12.920
x transpose x inverse
x transpose y,
00:30:12.920 --> 00:30:14.120
and that was it, OK?
00:30:14.120 --> 00:30:15.780
We had to just take
one derivative.
00:30:15.780 --> 00:30:19.580
Here, we're going to have a
generally concave likelihood.
00:30:19.580 --> 00:30:21.170
We're not going to
be able to actually
00:30:21.170 --> 00:30:23.750
solve this thing
directly in close form
00:30:23.750 --> 00:30:26.610
unless it's Gaussian,
but we will have--
00:30:26.610 --> 00:30:30.070
we'll see actually
how this is not just
00:30:30.070 --> 00:30:32.770
a black box optimization
of a concave function.
00:30:32.770 --> 00:30:35.830
We have a lot of properties
of this concave function,
00:30:35.830 --> 00:30:38.500
and we will be able to show
some iterative algorithms.
00:30:38.500 --> 00:30:42.880
We'll basically see how, when
you opened the box of convex
00:30:42.880 --> 00:30:46.270
optimization, you will actually
be able to see how things work
00:30:46.270 --> 00:30:49.070
and actually implement
it using least squares.
00:30:49.070 --> 00:30:51.260
So each iteration of
this iterative algorithm
00:30:51.260 --> 00:30:52.760
will essentially
be a least squares,
00:30:52.760 --> 00:30:54.700
and that's actually
quite [INAUDIBLE]..
00:30:54.700 --> 00:30:56.830
So, very demonstrative
of statisticians
00:30:56.830 --> 00:30:59.770
being pretty
ingenious so that they
00:30:59.770 --> 00:31:01.900
don't have to call in
some statistical software
00:31:01.900 --> 00:31:06.040
but just can repeatedly
call their least squares
00:31:06.040 --> 00:31:09.730
Oracle within a
statistical software.
00:31:09.730 --> 00:31:12.170
OK, so what is the
exponential family, right?
00:31:12.170 --> 00:31:14.390
I promised to do the
exponential family.
00:31:14.390 --> 00:31:17.540
Before we go into
this, let me just
00:31:17.540 --> 00:31:19.910
tell you something about
exponential families,
00:31:19.910 --> 00:31:22.040
and what's the only
thing to differentiate
00:31:22.040 --> 00:31:25.870
an exponential family from
all possible distributions?
00:31:25.870 --> 00:31:28.640
An exponential family has
two parameters, right?
00:31:28.640 --> 00:31:30.140
And those are not
really parameters,
00:31:30.140 --> 00:31:33.530
but there's this theta parameter
of my distribution, OK?
00:31:33.530 --> 00:31:35.450
So it's going to be
indexed by some parameter.
00:31:35.450 --> 00:31:37.324
Here, I'm only talking
about the distribution
00:31:37.324 --> 00:31:40.550
of, say, some random variable
or some random vector, OK?
00:31:40.550 --> 00:31:44.360
So here in this slide, you see
that the parameter theta that
00:31:44.360 --> 00:31:48.760
indexed those distribution
is k dimensional
00:31:48.760 --> 00:31:53.840
and the space of the x's
that I'm looking at-- so
00:31:53.840 --> 00:31:55.735
that should really be y, right?
00:31:55.735 --> 00:31:57.110
What I'm going to
plug in here is
00:31:57.110 --> 00:31:59.570
the conditional distribution
of y given x and theta is
00:31:59.570 --> 00:32:00.620
going to depend on x.
00:32:00.620 --> 00:32:02.110
But this really is the y.
00:32:02.110 --> 00:32:04.770
That's their distribution
of the response variable.
00:32:04.770 --> 00:32:06.620
And so this is on q, right?
00:32:06.620 --> 00:32:09.250
So I'm going to
assume that y takes--
00:32:09.250 --> 00:32:12.200
q dimensional--
is q dimensional.
00:32:12.200 --> 00:32:14.270
Clearly soon, q is
going to be equal to 1,
00:32:14.270 --> 00:32:16.340
but I can define those
things generally.
00:32:16.340 --> 00:32:17.750
OK, so I have this.
00:32:17.750 --> 00:32:19.710
I have to tell you
what this looks like.
00:32:19.710 --> 00:32:23.310
And let's assume that this is
a probability density function.
00:32:23.310 --> 00:32:26.360
So this, right this notation,
the fact that I just
00:32:26.360 --> 00:32:28.490
put my theta in
subscript, is just
00:32:28.490 --> 00:32:31.400
for me to remember that
this is the variable that
00:32:31.400 --> 00:32:34.160
indicates the random variable,
and this is just the parameter.
00:32:34.160 --> 00:32:37.400
But I could just write it as a
function of theta and x, right?
00:32:37.400 --> 00:32:39.650
This is just going to be--
right, if you were in calc,
00:32:39.650 --> 00:32:41.360
in multivariable
calc, you would have
00:32:41.360 --> 00:32:43.110
two parameter of theta
and x and you would
00:32:43.110 --> 00:32:45.320
need to give me a function.
00:32:45.320 --> 00:32:46.580
Now think of all--
00:32:46.580 --> 00:32:50.360
think of x and theta as being
one dimensional at this point.
00:32:50.360 --> 00:32:51.890
Think of all the
functions that can
00:32:51.890 --> 00:32:54.530
be depending on theta and x.
00:32:54.530 --> 00:32:56.660
There's many of them.
00:32:56.660 --> 00:33:01.810
And in particular, there's many
ways theta and x can interact.
00:33:01.810 --> 00:33:03.580
What the exponential
family does for you
00:33:03.580 --> 00:33:05.860
is that it restricts
the way these things
00:33:05.860 --> 00:33:07.877
can actually interact
with each other.
00:33:07.877 --> 00:33:09.460
It's essentially
saying the following.
00:33:09.460 --> 00:33:15.700
It's saying this is going to
be of the form exponential--
00:33:15.700 --> 00:33:18.100
so this exponential is
really not much because I
00:33:18.100 --> 00:33:20.020
could put a log next to it.
00:33:20.020 --> 00:33:24.940
But what I want is that
the way theta and x
00:33:24.940 --> 00:33:30.310
interact has to be of
the form theta times x
00:33:30.310 --> 00:33:32.530
in an exponential, OK?
00:33:32.530 --> 00:33:34.210
So that's the
simplest-- that's one
00:33:34.210 --> 00:33:36.585
of the ways you can think of
them interacting is you just
00:33:36.585 --> 00:33:37.900
the product of the two.
00:33:37.900 --> 00:33:40.450
Now clearly, this is
not a very rich family.
00:33:40.450 --> 00:33:43.090
So what I'm allowing
myself is to just slap
00:33:43.090 --> 00:33:46.000
on some terms that depend only
on theta and depend only on x.
00:33:46.000 --> 00:33:52.630
So let's just call this thing, I
don't know, f of x, g of theta.
00:33:52.630 --> 00:33:56.649
OK, so here, I've restricted the
way theta and x can interact.
00:33:56.649 --> 00:33:58.190
So I have something
that depends only
00:33:58.190 --> 00:33:59.981
on x, something that
depends only on theta.
00:33:59.981 --> 00:34:02.560
And here, I have this
very specific interaction.
00:34:02.560 --> 00:34:06.310
And that's all that exponential
families are doing for you, OK?
00:34:06.310 --> 00:34:09.840
So if we go back to this slide,
this is much more general,
00:34:09.840 --> 00:34:14.770
right? if I want to go from
theta and x in r to theta
00:34:14.770 --> 00:34:16.270
and x theta in r--
00:34:19.449 --> 00:34:26.659
to theta in r k and x in rq,
I cannot take the product
00:34:26.659 --> 00:34:27.386
of theta and x.
00:34:27.386 --> 00:34:29.719
I cannot even take the inner
product between theta and x
00:34:29.719 --> 00:34:32.030
because they're not even
of compatible dimensions.
00:34:32.030 --> 00:34:37.460
But what I can do is to first
map my theta into something
00:34:37.460 --> 00:34:40.940
and map my x into something
so that I actually end up
00:34:40.940 --> 00:34:42.080
having the same dimensions.
00:34:42.080 --> 00:34:43.550
And then I can take
the inner product.
00:34:43.550 --> 00:34:44.900
That's the natural
generalization
00:34:44.900 --> 00:34:45.858
of this simple product.
00:34:59.800 --> 00:35:03.340
OK, so what I have is--
00:35:03.340 --> 00:35:05.230
right, so if I want
to go from theta
00:35:05.230 --> 00:35:10.510
to x, when I'm going to first
do is I'm going to take theta,
00:35:10.510 --> 00:35:11.710
eta of theta--
00:35:11.710 --> 00:35:16.590
so let's say eta1 of
theta to eta k of theta.
00:35:20.100 --> 00:35:22.220
And then I'm going
to actually take
00:35:22.220 --> 00:35:29.994
x becomes t1 of x all
the way to tk of x.
00:35:29.994 --> 00:35:32.160
And what I'm going to do
is take the inner product--
00:35:32.160 --> 00:35:35.540
so let's call this eta
and let's call this t.
00:35:35.540 --> 00:35:39.710
And I'm going to take the inner
product of eta and t, which
00:35:39.710 --> 00:35:49.550
is just the sum from j equal
1 to k of eta j of theta times
00:35:49.550 --> 00:35:52.770
tj of x.
00:35:52.770 --> 00:35:57.690
OK, so that's just a way to say
I want this simple interaction
00:35:57.690 --> 00:35:58.690
but in higher dimension.
00:35:58.690 --> 00:36:00.900
The simplest way I can actually
make those things happen
00:36:00.900 --> 00:36:02.233
is just by taking inner product.
00:36:05.490 --> 00:36:07.010
OK, and so now what
it's telling me
00:36:07.010 --> 00:36:09.630
is that the distribution-- so
I want the exponential times
00:36:09.630 --> 00:36:11.921
something that depends only
on theta and something that
00:36:11.921 --> 00:36:12.990
depends only on x.
00:36:12.990 --> 00:36:14.490
And so what it tells
me is that when
00:36:14.490 --> 00:36:16.200
I'm going to take
p of theta x, it's
00:36:16.200 --> 00:36:19.640
just going to be something
which is exponential
00:36:19.640 --> 00:36:30.225
times the sum from j equal 1
to k of eta j theta tj of x.
00:36:30.225 --> 00:36:32.600
And then I'm going to have a
function that depends only--
00:36:32.600 --> 00:36:36.040
so let me read it for now
like c of theta and then
00:36:36.040 --> 00:36:37.610
a function that
depends only on x.
00:36:37.610 --> 00:36:39.490
Let me call it h of x.
00:36:39.490 --> 00:36:42.340
And for convenience,
there's no particular reason
00:36:42.340 --> 00:36:43.300
why I do that.
00:36:43.300 --> 00:36:45.220
I'm taking this
function c of theta
00:36:45.220 --> 00:36:47.300
and I'm just actually
pushing it in there.
00:36:47.300 --> 00:36:57.182
So I can write c of theta as
exponential minus log of 1
00:36:57.182 --> 00:36:58.612
over c of theta, right?
00:37:01.330 --> 00:37:03.324
And now I have exponential
times exponential.
00:37:03.324 --> 00:37:04.990
So I push it in, and
this thing actually
00:37:04.990 --> 00:37:10.320
looks like exponential sum
from j equal 1 to k of eta
00:37:10.320 --> 00:37:22.120
j theta tj of x minus log 1
over c of theta times h of x.
00:37:22.120 --> 00:37:26.030
And this thing here, log 1 over
c of theta, I call actually
00:37:26.030 --> 00:37:32.060
b of theta Because
c, I called it c.
00:37:32.060 --> 00:37:35.140
But I can actually
directly call this guy b,
00:37:35.140 --> 00:37:38.160
and I don't actually
care about c itself.
00:37:38.160 --> 00:37:43.900
Now why don't I put back
also h of x in there?
00:37:43.900 --> 00:37:48.949
Because h of x is
really here to just--
00:37:48.949 --> 00:37:50.398
how to put it--
00:37:54.262 --> 00:38:00.160
OK, h of x and b of theta
don't play the same role.
00:38:00.160 --> 00:38:03.490
B of theta in many ways is a
normalizing constant, right?
00:38:03.490 --> 00:38:06.820
I want this density
to integrate to 1.
00:38:06.820 --> 00:38:09.520
If I did not have
this guy, I'm not
00:38:09.520 --> 00:38:11.950
guaranteed that this
thing integrates to 1.
00:38:11.950 --> 00:38:14.860
But by tweaking this function
b of theta or c of theta--
00:38:14.860 --> 00:38:16.080
they're equivalent--
00:38:16.080 --> 00:38:18.350
I can actually ensure that
this thing integrates to 1.
00:38:18.350 --> 00:38:22.834
So b of theta is just
a normalizing constant.
00:38:22.834 --> 00:38:25.000
H of x is something that's
going to be funny for us.
00:38:25.000 --> 00:38:26.583
It's going to be
something that allows
00:38:26.583 --> 00:38:29.740
us to be able to treat both
discrete and continuous
00:38:29.740 --> 00:38:38.140
variables within the framework
of exponential families.
00:38:38.140 --> 00:38:40.060
So for those that are
familiar with this,
00:38:40.060 --> 00:38:41.890
this is essentially
saying that that h of x
00:38:41.890 --> 00:38:44.120
is really just a
change of measure.
00:38:44.120 --> 00:38:48.370
When I actually look at
the density of p of theta--
00:38:48.370 --> 00:38:50.320
this is with respect
to some measure--
00:38:50.320 --> 00:38:52.810
the fact that I just multiplied
by a function of x just
00:38:52.810 --> 00:38:53.990
means that I'm not looking--
00:38:53.990 --> 00:38:56.420
that this guy here
without h of theta
00:38:56.420 --> 00:38:59.390
is not the density with respect
to the original measure,
00:38:59.390 --> 00:39:01.660
but it's the density with
respect to the distribution
00:39:01.660 --> 00:39:04.272
that has h as a density.
00:39:04.272 --> 00:39:05.480
That's all I'm saying, right?
00:39:05.480 --> 00:39:08.650
So I can first transform my
x's and then take the density
00:39:08.650 --> 00:39:10.300
with respect to that.
00:39:10.300 --> 00:39:12.851
If you don't want to think
about densities or measures,
00:39:12.851 --> 00:39:13.600
you don't have to.
00:39:13.600 --> 00:39:14.790
This is just the way--
00:39:14.790 --> 00:39:16.930
this is just the definition.
00:39:16.930 --> 00:39:19.610
Is there any question
about this definition?
00:39:19.610 --> 00:39:21.290
All right, so it
looks complicated,
00:39:21.290 --> 00:39:23.560
but it's actually
essentially the simplest
00:39:23.560 --> 00:39:25.360
way you could think about it.
00:39:25.360 --> 00:39:29.004
You want to be able to
have x and theta interact
00:39:29.004 --> 00:39:30.670
and you just say, I
want the interaction
00:39:30.670 --> 00:39:34.126
to be of the form
exponential x times theta.
00:39:34.126 --> 00:39:35.500
And if they're
higher dimensions,
00:39:35.500 --> 00:39:36.530
I'm going to take
the exponential
00:39:36.530 --> 00:39:38.203
of the function
of x inner product
00:39:38.203 --> 00:39:39.244
with a function of theta.
00:39:43.749 --> 00:39:45.540
All right, so I claimed
since the beginning
00:39:45.540 --> 00:39:47.167
that the Gaussian
was such an example.
00:39:47.167 --> 00:39:48.000
So let's just do it.
00:39:48.000 --> 00:39:51.330
So is the Gaussian of the-- is
the interaction between theta
00:39:51.330 --> 00:39:55.350
and x in a Gaussian of
the form in the product?
00:39:55.350 --> 00:39:58.680
And the answer is yes.
00:39:58.680 --> 00:40:03.480
Actually, whether I know or
not what the variance is, OK?
00:40:03.480 --> 00:40:06.747
So let's start for the case
where I actually do not
00:40:06.747 --> 00:40:07.830
know what the variance is.
00:40:07.830 --> 00:40:13.500
So here, I have x is
n mu sigma squared.
00:40:13.500 --> 00:40:14.804
This is all one dimensional.
00:40:14.804 --> 00:40:17.220
And here, I'm going to assume
that my parameter is both mu
00:40:17.220 --> 00:40:19.440
and sigma square.
00:40:19.440 --> 00:40:22.135
OK, so what I need to do is
to have some function of mu,
00:40:22.135 --> 00:40:24.510
some function of stigma square,
and take an inner product
00:40:24.510 --> 00:40:26.635
of some function of x and
some other function of x.
00:40:26.635 --> 00:40:29.060
So I want to show that--
00:40:29.060 --> 00:40:32.350
so p theta of x is what?
00:40:32.350 --> 00:40:36.390
Well, it's one over
square root sigma 2 pi
00:40:36.390 --> 00:40:42.280
exponential minus x minus mu
squared over 2 sigma squared,
00:40:42.280 --> 00:40:44.370
right?
00:40:44.370 --> 00:40:45.840
So that's just my
Gaussian density.
00:40:45.840 --> 00:40:49.410
And I want to say that
this thing here-- so
00:40:49.410 --> 00:40:51.660
clearly, the exponential
shows up already.
00:40:51.660 --> 00:40:53.970
I want to show that this
is something that looks
00:40:53.970 --> 00:41:01.620
like, you know, eta 1 of--
00:41:01.620 --> 00:41:08.395
sorry, so that was-- yeah, eta
1 of, say, mu sigma squared.
00:41:08.395 --> 00:41:09.770
So I have only
two of those guys,
00:41:09.770 --> 00:41:11.780
so I'm going to need
only two etas, right?
00:41:11.780 --> 00:41:16.030
So I want it to be eta 1
of mu and sigma times t1
00:41:16.030 --> 00:41:22.940
of x plus eta 2 mu 1 mu sigma
squared times t2 of x, right?
00:41:22.940 --> 00:41:26.070
So I want to have something
like that that shows up,
00:41:26.070 --> 00:41:27.830
and the only things
that are left,
00:41:27.830 --> 00:41:32.250
I want them to depend either
only on theta or only on x.
00:41:32.250 --> 00:41:37.500
So to find that out,
we just need to expand.
00:41:37.500 --> 00:41:42.480
OK, so I'm going to first put
everything into my exponential
00:41:42.480 --> 00:41:43.650
and expand this guy.
00:41:43.650 --> 00:41:46.110
So the first term here
is going to be minus x
00:41:46.110 --> 00:41:47.876
squared over 2 sigma square.
00:41:47.876 --> 00:41:49.500
The second term is
going to be minus mu
00:41:49.500 --> 00:41:51.150
squared over two sigma squared.
00:41:51.150 --> 00:41:55.650
And then the cross term is
going to be plus x mu divided
00:41:55.650 --> 00:41:57.014
by sigma squared.
00:41:57.014 --> 00:41:58.680
And then I'm going
to put this guy here.
00:41:58.680 --> 00:42:05.037
So I have a minus log
sigma over 2 pi, OK?
00:42:09.020 --> 00:42:13.740
OK, is this-- so this term
here contains an interaction
00:42:13.740 --> 00:42:15.510
between X and the parameters.
00:42:15.510 --> 00:42:17.250
This term here
contains an interaction
00:42:17.250 --> 00:42:18.480
between X and the parameters.
00:42:18.480 --> 00:42:21.240
So let me try to write
them in a way that I want.
00:42:21.240 --> 00:42:22.950
This guy only depends
on the parameters,
00:42:22.950 --> 00:42:25.870
this guy only depends
on the parameter.
00:42:25.870 --> 00:42:28.390
So I'm going to
rearrange things.
00:42:28.390 --> 00:42:34.080
And so I claim that this
is of the form x squared.
00:42:34.080 --> 00:42:36.369
Well, let's say-- do--
00:42:43.770 --> 00:42:44.800
who's getting the minus?
00:42:44.800 --> 00:42:46.180
Eta, OK.
00:42:46.180 --> 00:42:52.960
So it's x squared times
minus 1 over 2 sigma
00:42:52.960 --> 00:42:58.450
squared plus x times mu
over sigma squared, right?
00:42:58.450 --> 00:42:59.630
So that's this term here.
00:42:59.630 --> 00:43:01.060
That's this term here.
00:43:01.060 --> 00:43:04.129
Now I need to get this guy
here, and that's minus.
00:43:04.129 --> 00:43:05.920
So I'm going to write
it like this-- minus,
00:43:05.920 --> 00:43:09.950
and now I have mu
squared over 2 sigma
00:43:09.950 --> 00:43:15.648
squared plus log sigma
square root 2 pi.
00:43:22.210 --> 00:43:31.430
And now this thing is definitely
of the form t of x times--
00:43:31.430 --> 00:43:34.020
did I call them the
right way or not?
00:43:34.020 --> 00:43:36.490
Of course not.
00:43:36.490 --> 00:43:39.450
OK, so that's going to
be t2 of x times eta
00:43:39.450 --> 00:43:41.820
2 of x eta 2 of theta.
00:43:41.820 --> 00:43:48.230
This guy is going to be t1
of x times eta 1 of theta.
00:43:48.230 --> 00:43:50.992
All right, so just a function
of theta times a function of x--
00:43:50.992 --> 00:43:52.950
just a function of theta
times a function of x.
00:43:52.950 --> 00:43:55.680
And the way combined is
just by sending them.
00:43:55.680 --> 00:43:58.690
And this is going
to be my d of theta.
00:44:01.710 --> 00:44:04.700
What is h of x?
00:44:04.700 --> 00:44:06.145
AUDIENCE: 1.
00:44:06.145 --> 00:44:07.020
PHILIPPE RIGOLLET: 1.
00:44:07.020 --> 00:44:09.800
There's one thing I
can actually play with,
00:44:09.800 --> 00:44:13.040
and this is something you're
going to have some three
00:44:13.040 --> 00:44:14.090
choices, right?
00:44:14.090 --> 00:44:19.850
This is not actually completely
determined here is that--
00:44:19.850 --> 00:44:27.220
for example, so when I write
the log sigma square root 2 pi,
00:44:27.220 --> 00:44:32.660
this is just log of sigma
plus log square root 2 pi.
00:44:32.660 --> 00:44:34.270
So I have two choices here.
00:44:34.270 --> 00:44:37.670
Either my b becomes
this guy, or--
00:44:37.670 --> 00:44:41.150
so either I have
b of theta, which
00:44:41.150 --> 00:44:45.320
is mu squared over 2 sigma
squared plus log sigma
00:44:45.320 --> 00:44:51.920
square root 2 pi and h of
x is equal to 1, or I have
00:44:51.920 --> 00:44:56.120
that b of theta is mu
square over 2 sigma squared
00:44:56.120 --> 00:44:58.120
plus log sigma.
00:44:58.120 --> 00:44:59.750
And h of x is equal to what?
00:45:08.400 --> 00:45:10.300
Well, I can just push
this guy out, right?
00:45:10.300 --> 00:45:12.160
I can push it out
of the exponential.
00:45:12.160 --> 00:45:15.370
And so it's just square
root of 2 pi, which is
00:45:15.370 --> 00:45:16.590
a function of x, technically.
00:45:16.590 --> 00:45:19.760
I mean, it's a constant function
of x, but it's a function.
00:45:19.760 --> 00:45:22.420
So you can see that it's
not completely clear
00:45:22.420 --> 00:45:25.090
how you're going to do
the trade off, right?
00:45:25.090 --> 00:45:28.840
So the constant terms can
go either in b or in h.
00:45:28.840 --> 00:45:33.660
But you know, why bother with
tracking down b and h when
00:45:33.660 --> 00:45:35.410
you can actually stuff
everything into one
00:45:35.410 --> 00:45:38.200
and just call h one
and call it a day?
00:45:38.200 --> 00:45:40.770
Right, so you can
just forget about h.
00:45:40.770 --> 00:45:43.790
You know it's one and
think about the right.
00:45:43.790 --> 00:45:46.410
H won't matter actually for
estimation purposes or anything
00:45:46.410 --> 00:45:48.386
like this.
00:45:48.386 --> 00:45:50.760
All right, so that's basically
everything that's written.
00:45:50.760 --> 00:45:55.040
When stigma square
is known, what's
00:45:55.040 --> 00:46:00.020
happening is that this
guy here is no longer
00:46:00.020 --> 00:46:03.640
a function of theta, right?
00:46:03.640 --> 00:46:05.401
Agreed?
00:46:05.401 --> 00:46:06.650
This is no longer a parameter.
00:46:06.650 --> 00:46:14.990
When sigma square is known,
then theta is equal to mu only.
00:46:14.990 --> 00:46:17.660
There's no sigma
square going on.
00:46:17.660 --> 00:46:19.657
So this-- everything
depends on sigma square
00:46:19.657 --> 00:46:20.990
can be thought of as a constant.
00:46:20.990 --> 00:46:23.010
Think one.
00:46:23.010 --> 00:46:26.910
So in particular, this
term here does not
00:46:26.910 --> 00:46:30.270
belong in the interaction
between x and theta.
00:46:30.270 --> 00:46:37.150
It belongs to h, right?
00:46:37.150 --> 00:46:49.120
So if sigma is known, then this
guy is only a function of h--
00:46:49.120 --> 00:46:50.770
of x.
00:46:50.770 --> 00:47:01.420
So h of x becomes exponential
x squared minus x squared
00:47:01.420 --> 00:47:05.674
over 2 sigma squared, right?
00:47:05.674 --> 00:47:06.840
That's just a function of x.
00:47:11.010 --> 00:47:11.681
Is that clear?
00:47:16.100 --> 00:47:18.650
So if you complete this
computation, what you're
00:47:18.650 --> 00:47:28.402
going to get is that your new
one parameter thing is that p
00:47:28.402 --> 00:47:35.760
theta x is not equal to
exponential x times mu
00:47:35.760 --> 00:47:39.227
over sigma squared minus--
00:47:39.227 --> 00:47:40.560
well, it's still the same thing.
00:47:49.300 --> 00:47:51.390
And then you have your
h of x that comes out--
00:47:54.210 --> 00:47:58.370
x squared over 2 sigma squared.
00:47:58.370 --> 00:48:02.350
OK, so that's my h of x.
00:48:02.350 --> 00:48:05.960
That's still my b of theta.
00:48:05.960 --> 00:48:11.260
And this is my t1 of x.
00:48:11.260 --> 00:48:15.190
And this is my eta one of theta.
00:48:15.190 --> 00:48:18.060
And remember, theta is just
equal to mu in this case.
00:48:22.680 --> 00:48:26.610
So if I ask you prove that
this distribution belongs
00:48:26.610 --> 00:48:29.390
to an exponential family,
you just have to work it out.
00:48:29.390 --> 00:48:32.480
Typically, it's expanding what's
in the exponential and see
00:48:32.480 --> 00:48:33.180
what's--
00:48:33.180 --> 00:48:35.690
and just write it in
this term and identify
00:48:35.690 --> 00:48:36.890
all the components, right?
00:48:36.890 --> 00:48:39.576
So here, notice those guys
don't even get an index anymore
00:48:39.576 --> 00:48:40.950
because there's
just one of them.
00:48:40.950 --> 00:48:45.629
So I wrote eta 1 and t1, but
it's really just eta and t.
00:48:50.270 --> 00:48:54.410
Oh sorry, this guy also goes.
00:48:54.410 --> 00:48:56.240
This is also a constant, right?
00:48:56.240 --> 00:49:01.240
So it can actually
just put sigma divided
00:49:01.240 --> 00:49:03.390
by sigma square root 2 pi.
00:49:03.390 --> 00:49:04.790
So h of x is what, actually?
00:49:08.718 --> 00:49:12.155
Is it the density of--
00:49:12.155 --> 00:49:13.630
AUDIENCE: Standard [INAUDIBLE].
00:49:13.630 --> 00:49:14.350
PHILIPPE RIGOLLET:
It's not standard.
00:49:14.350 --> 00:49:15.340
It's centered.
00:49:15.340 --> 00:49:16.330
It has mean 0.
00:49:16.330 --> 00:49:18.810
But it variance
sigma squared, right?
00:49:18.810 --> 00:49:21.060
But it's the density
of a Gaussian.
00:49:21.060 --> 00:49:23.620
And this is what I
meant when I said
00:49:23.620 --> 00:49:27.280
h of x is really just telling
you with respect to which
00:49:27.280 --> 00:49:30.920
distribution, which measure
you're taking the density.
00:49:30.920 --> 00:49:33.310
And so this thing here
is really telling you
00:49:33.310 --> 00:49:37.690
the density of my
Gaussian with mean mu
00:49:37.690 --> 00:49:41.710
is equal to-- is this with
respect to a centered Gaussian
00:49:41.710 --> 00:49:43.636
is this guy, right?
00:49:43.636 --> 00:49:44.510
That's what it means.
00:49:44.510 --> 00:49:46.310
If this thing ends
up being a density,
00:49:46.310 --> 00:49:49.370
it just means that now you
just have a new measure, which
00:49:49.370 --> 00:49:51.270
is this density.
00:49:51.270 --> 00:49:53.270
So it's just saying
that the density
00:49:53.270 --> 00:49:57.560
of the Gaussian with
mean mu with respect
00:49:57.560 --> 00:50:00.928
to the Gaussian with mean 0
is just this [INAUDIBLE] here.
00:50:05.140 --> 00:50:07.960
All right, so let's move on.
00:50:07.960 --> 00:50:11.050
So here, as I said,
you could actually
00:50:11.050 --> 00:50:13.840
do all these computations
and forget about the fact
00:50:13.840 --> 00:50:16.430
that x is continuous.
00:50:16.430 --> 00:50:20.690
You can actually do it with PMFs
and do it for x is discrete.
00:50:20.690 --> 00:50:23.540
This actually also tells
you if you can actually
00:50:23.540 --> 00:50:26.540
get the same form for
your density, which
00:50:26.540 --> 00:50:29.000
is of the form exponential
times the product
00:50:29.000 --> 00:50:32.060
of the the interaction
between theta
00:50:32.060 --> 00:50:34.010
and x is just
taking this product,
00:50:34.010 --> 00:50:36.950
then a function only of theta
and of function only of x,
00:50:36.950 --> 00:50:40.130
for the PMF, it also works.
00:50:40.130 --> 00:50:42.050
OK, so I claim
that the Bernoulli
00:50:42.050 --> 00:50:44.960
belongs to this family.
00:50:44.960 --> 00:50:49.380
So the PMF of a Bernoulli--
00:50:49.380 --> 00:50:54.590
we say parameter p is p to the
x 1 minus p to the 1 minus x,
00:50:54.590 --> 00:50:55.540
right?
00:50:55.540 --> 00:51:00.440
Because we know so that's
only for x equals 0 or 1.
00:51:00.440 --> 00:51:03.160
And the reason is because
when x is equal to 0,
00:51:03.160 --> 00:51:04.060
this is 1 minus p.
00:51:04.060 --> 00:51:06.627
When x is equal to
1, this is minus 0.
00:51:06.627 --> 00:51:08.210
OK, we've seen that
when we're looking
00:51:08.210 --> 00:51:11.730
at likelihoods for Bernoullis.
00:51:11.730 --> 00:51:16.700
OK, this is not clear this is
going to look like this at all.
00:51:16.700 --> 00:51:19.610
But let's do it.
00:51:19.610 --> 00:51:21.660
OK, so what does
this thing look like?
00:51:21.660 --> 00:51:23.150
Well, the first
thing I want to do
00:51:23.150 --> 00:51:24.710
is to make an
exponential show up.
00:51:24.710 --> 00:51:26.084
So what I'm going
to write is I'm
00:51:26.084 --> 00:51:31.190
going to write p to the x as
exponential x log p, right?
00:51:33.714 --> 00:51:35.630
And so I'm going to do
that for the other one.
00:51:35.630 --> 00:51:37.690
So this thing here--
00:51:37.690 --> 00:51:43.090
so I'm going to get
exponential x log
00:51:43.090 --> 00:51:47.038
p plus 1 minus x log 1 minus p.
00:51:51.250 --> 00:51:54.050
So what I need to do is
to collect my terms in x
00:51:54.050 --> 00:51:56.750
and my terms in whatever
parameters I have,
00:51:56.750 --> 00:51:59.080
see here if theta is equal to p.
00:52:03.180 --> 00:52:05.170
So if I do this,
what I end up having
00:52:05.170 --> 00:52:08.440
is equal to exponential--
00:52:08.440 --> 00:52:12.650
so determine x is log
p minus log 1 minus p.
00:52:12.650 --> 00:52:18.140
So that's x times
log p over 1 minus p.
00:52:18.140 --> 00:52:20.050
And then the term
that rest is just--
00:52:20.050 --> 00:52:23.276
that stays is just 1
times log 1 minus p.
00:52:23.276 --> 00:52:25.400
But I want to see this as
a minus something, right?
00:52:25.400 --> 00:52:27.067
It was minus b of theta.
00:52:27.067 --> 00:52:28.525
So I'm going to
write it as minus--
00:52:32.890 --> 00:52:35.150
well, I can just keep the
plus, and I'm going to do--
00:52:41.770 --> 00:52:44.362
and that's all [INAUDIBLE].
00:52:44.362 --> 00:52:46.340
A-ha!
00:52:46.340 --> 00:52:48.060
Well, this is of the
form exponential--
00:52:48.060 --> 00:52:50.940
something that depends only on
x times something that depends
00:52:50.940 --> 00:52:52.720
only on theta--
00:52:52.720 --> 00:52:56.000
minus a function that
depends only on theta.
00:52:56.000 --> 00:52:59.230
And then h of x is
equal to 1 again.
00:52:59.230 --> 00:53:00.630
OK, so let's see.
00:53:00.630 --> 00:53:03.410
So I have t1 of x is equal to x.
00:53:03.410 --> 00:53:04.880
That's this guy.
00:53:04.880 --> 00:53:11.000
Eta 1 of theta is equal
to log p1 minus p.
00:53:11.000 --> 00:53:20.930
And b of theta is equal to
log 1 over 1 minus p, OK?
00:53:20.930 --> 00:53:26.470
And h of x is equal
to 1, all right?
00:53:29.310 --> 00:53:31.230
You guys want to do
Poisson, or do you
00:53:31.230 --> 00:53:32.313
want to have any homework?
00:53:35.120 --> 00:53:37.670
It's a dilemma because that's
an easy homework versus
00:53:37.670 --> 00:53:41.410
no homework at all but maybe
something more difficult. OK,
00:53:41.410 --> 00:53:43.680
who wants to do it now?
00:53:43.680 --> 00:53:46.331
Who does not want to
raise their hand now?
00:53:46.331 --> 00:53:47.747
Who wants to raise
their hand now?
00:53:47.747 --> 00:53:57.116
All right, so let's move on.
00:53:57.116 --> 00:53:59.240
I'll just do-- do you want
to do the gammas instead
00:53:59.240 --> 00:54:00.080
in the homework?
00:54:00.080 --> 00:54:02.150
That's going to be fun.
00:54:02.150 --> 00:54:04.450
I'm not even going to
propose to do the gammas.
00:54:04.450 --> 00:54:08.570
And so this is the
gamma distribution.
00:54:08.570 --> 00:54:10.820
It's brilliantly
called gamma because it
00:54:10.820 --> 00:54:14.480
has the gamma function just
like the beta distribution had
00:54:14.480 --> 00:54:16.400
the beta function in there.
00:54:16.400 --> 00:54:17.690
They look very similar.
00:54:17.690 --> 00:54:20.960
One is defined over r plus,
the positive real line.
00:54:20.960 --> 00:54:24.650
And remember, the beta was
defined over the interval 0, 1.
00:54:24.650 --> 00:54:28.640
And it's of the form x to
some power times exponential
00:54:28.640 --> 00:54:30.980
of minus x to some--
00:54:30.980 --> 00:54:32.340
times something, right?
00:54:32.340 --> 00:54:34.298
So there's a function of
polynomial [INAUDIBLE]
00:54:34.298 --> 00:54:38.004
x where the exponent
depends on the parameter.
00:54:38.004 --> 00:54:40.670
And then there's the exponential
minus x times something depends
00:54:40.670 --> 00:54:41.930
on the parameters.
00:54:41.930 --> 00:54:47.921
So this is going to also look
like some function of x--
00:54:47.921 --> 00:54:49.670
sorry, like some
exponential distribution.
00:54:49.670 --> 00:54:52.280
Can somebody guess what
is going to be t2 of x?
00:54:58.338 --> 00:55:01.251
Oh, those are the functions of
x that show up in this product,
00:55:01.251 --> 00:55:01.750
right?
00:55:01.750 --> 00:55:03.462
Remember when we have this--
00:55:03.462 --> 00:55:05.170
we just need to take
some transformations
00:55:05.170 --> 00:55:08.870
of x so it looks linear in those
things and not in x itself.
00:55:08.870 --> 00:55:11.570
Remember, we had x squared
and x, for example,
00:55:11.570 --> 00:55:12.560
in the Gaussian case.
00:55:12.560 --> 00:55:14.471
I don't know if
it's still there.
00:55:14.471 --> 00:55:15.720
Yeah, it's still there, right?
00:55:15.720 --> 00:55:17.330
t2 was x squared.
00:55:17.330 --> 00:55:20.540
What do you think x is
going-- t2 of x here.
00:55:20.540 --> 00:55:23.536
So here's a hint.
t1 is going to be x.
00:55:23.536 --> 00:55:24.410
AUDIENCE: [INAUDIBLE]
00:55:24.410 --> 00:55:25.480
PHILIPPE RIGOLLET:
Yeah, [INAUDIBLE],,
00:55:25.480 --> 00:55:26.438
what is going to be t1?
00:55:26.438 --> 00:55:27.890
Yeah, you can--
this one is taken.
00:55:27.890 --> 00:55:28.690
This one is taken.
00:55:31.313 --> 00:55:32.580
What?
00:55:32.580 --> 00:55:33.700
Log x, right?
00:55:33.700 --> 00:55:35.680
Because this x to
the a minus 1, I'm
00:55:35.680 --> 00:55:39.380
going to write that as
exponential a minus 1 log x.
00:55:39.380 --> 00:55:43.375
So basically, eta 1 is
going to be a minus 1.
00:55:43.375 --> 00:55:47.560
Eta 2 is going to
be minus 1 over b--
00:55:47.560 --> 00:55:48.980
well, actually the opposite.
00:55:48.980 --> 00:55:50.271
And then you're going to have--
00:55:50.271 --> 00:55:52.630
but this is actually
not too complicated.
00:55:52.630 --> 00:55:55.090
All right, then those
parameters get names.
00:55:55.090 --> 00:55:58.480
a is the shape parameter,
b is the scale parameter.
00:55:58.480 --> 00:56:00.280
It doesn't really matter.
00:56:00.280 --> 00:56:02.710
You have other things that
are called the inverse gamma
00:56:02.710 --> 00:56:05.850
distribution, which
has this form.
00:56:05.850 --> 00:56:09.360
The difference is that
the parameter alpha
00:56:09.360 --> 00:56:14.700
shows negatively there and
then the inverse Gaussian
00:56:14.700 --> 00:56:15.480
distribution.
00:56:18.150 --> 00:56:20.220
You know, just densities
you can come up with
00:56:20.220 --> 00:56:23.305
and they just happened
to fall in this family.
00:56:23.305 --> 00:56:25.680
And there's other ones that
you can actually put in there
00:56:25.680 --> 00:56:26.640
that we've seen before.
00:56:26.640 --> 00:56:28.695
The chi-square is actually
part of this family.
00:56:28.695 --> 00:56:30.570
The beta distribution
is part of this family.
00:56:30.570 --> 00:56:32.611
The binomial distribution
is part of this family.
00:56:32.611 --> 00:56:35.030
Well, that's easy because
the Bernoulli was.
00:56:35.030 --> 00:56:39.390
The negative binomial, which
is some stopping time--
00:56:39.390 --> 00:56:42.600
the first time you hit a
certain number of successes
00:56:42.600 --> 00:56:46.120
when you flip some
Bernoulli coins.
00:56:46.120 --> 00:56:47.665
So you can check
for all of those,
00:56:47.665 --> 00:56:50.040
and you will see that you can
actually write them as part
00:56:50.040 --> 00:56:51.510
of the exponential family.
00:56:51.510 --> 00:56:53.040
So the main goal
of this slide is
00:56:53.040 --> 00:56:54.581
to convince you that
this is actually
00:56:54.581 --> 00:56:56.400
a pretty broad range
of distributions
00:56:56.400 --> 00:57:00.360
because it basically includes
everything we've seen
00:57:00.360 --> 00:57:03.540
but not anything there--
00:57:03.540 --> 00:57:06.540
sorry, plus more, OK?
00:57:06.540 --> 00:57:07.040
Yeah.
00:57:07.040 --> 00:57:09.040
AUDIENCE: Is there any
example of a distribution
00:57:09.040 --> 00:57:10.456
that comes up
pretty often that's
00:57:10.456 --> 00:57:11.801
not in the exponential family?
00:57:11.801 --> 00:57:13.384
PHILIPPE RIGOLLET:
Yeah, like uniform.
00:57:13.384 --> 00:57:16.312
AUDIENCE: Oh, OK, so maybe
a bit more complicated than
00:57:16.312 --> 00:57:17.702
[INAUDIBLE].
00:57:17.702 --> 00:57:19.410
Anything Anything that
has a support that
00:57:19.410 --> 00:57:21.740
depends on the parameter
is not going to fall--
00:57:21.740 --> 00:57:24.410
is not going to fit in there.
00:57:24.410 --> 00:57:26.570
Right, and you can
actually convince yourself
00:57:26.570 --> 00:57:31.910
why anything that
has the support that
00:57:31.910 --> 00:57:33.680
does not-- that depends
on the parameter
00:57:33.680 --> 00:57:35.310
is not going to be
part of this guy.
00:57:35.310 --> 00:57:37.460
It's kind of a hard thing to--
00:57:37.460 --> 00:57:42.330
in fact, you proved that it's
not and you prove this rule.
00:57:42.330 --> 00:57:43.850
That's kind of a
little difficult,
00:57:43.850 --> 00:57:46.190
but the way you can convince
yourself is that remember,
00:57:46.190 --> 00:57:49.910
the only interaction between
x and theta that I allowed
00:57:49.910 --> 00:57:51.470
was taking the
product of those guys
00:57:51.470 --> 00:57:54.160
and then the exponential, right?
00:57:54.160 --> 00:57:56.660
If you have something that
depends on some parameter--
00:57:56.660 --> 00:57:59.740
let's say you're going to see
something that looks like this.
00:57:59.740 --> 00:58:01.510
Right, for uniform,
it looks like this.
00:58:04.720 --> 00:58:08.210
Well, this is not of the form
exponential x times theta.
00:58:08.210 --> 00:58:10.990
There's an interaction
between x and theta here,
00:58:10.990 --> 00:58:12.840
but it's actually
certainly not of the form
00:58:12.840 --> 00:58:14.580
x exponential x times theta.
00:58:14.580 --> 00:58:16.680
So this is definitely
not going to be
00:58:16.680 --> 00:58:18.210
part of the exponential family.
00:58:18.210 --> 00:58:20.680
And every time you start
doing things like that,
00:58:20.680 --> 00:58:21.930
it's just not going to happen.
00:58:25.790 --> 00:58:28.370
Actually, to be fair,
I'm not even sure
00:58:28.370 --> 00:58:30.680
that all these
guys, when you allow
00:58:30.680 --> 00:58:32.600
them to have all
their parameters free,
00:58:32.600 --> 00:58:34.810
are actually going
to be part of this.
00:58:34.810 --> 00:58:36.500
For example-- the
beta probably is,
00:58:36.500 --> 00:58:38.730
but I'm not actually
entirely convinced.
00:58:43.140 --> 00:58:47.320
There's books on
experiential families.
00:58:47.320 --> 00:58:48.970
All right, so let's go back.
00:58:48.970 --> 00:58:52.060
So here, we've put a lot
of effort understanding
00:58:52.060 --> 00:58:57.160
how big, how much wider than
the Gaussian distribution
00:58:57.160 --> 00:59:01.630
can we think of for the
conditional distribution
00:59:01.630 --> 00:59:04.030
of our response y given x.
00:59:04.030 --> 00:59:06.620
So let's go back to the
generalized linear models,
00:59:06.620 --> 00:59:07.120
right?
00:59:07.120 --> 00:59:09.870
So [INAUDIBLE] said, OK,
the random component?
00:59:09.870 --> 00:59:11.800
y has to be part of
some exponential family
00:59:11.800 --> 00:59:13.090
distribution-- check.
00:59:13.090 --> 00:59:14.726
We know what this means.
00:59:14.726 --> 00:59:16.350
So now I have to
understand two things.
00:59:16.350 --> 00:59:20.127
I have to understand what
is the expectation, right?
00:59:20.127 --> 00:59:21.960
Because that's actually
what I model, right?
00:59:21.960 --> 00:59:24.160
I take the expectation, the
conditional expectation,
00:59:24.160 --> 00:59:24.850
of y given x.
00:59:24.850 --> 00:59:27.100
So I need to understand
given this guy,
00:59:27.100 --> 00:59:30.250
it would be nice if you had some
simple rules that would tell me
00:59:30.250 --> 00:59:32.950
exactly what the expectation
is rather than having to do it
00:59:32.950 --> 00:59:34.360
over and over again, right?
00:59:34.360 --> 00:59:36.100
If I told you,
here's a Gaussian,
00:59:36.100 --> 00:59:37.600
compute the
expectation, every time
00:59:37.600 --> 00:59:40.750
you had to use that would
be slightly painful.
00:59:40.750 --> 00:59:43.510
So hopefully, this thing
being simple enough--
00:59:43.510 --> 00:59:45.870
we've actually
selected a class that's
00:59:45.870 --> 00:59:47.590
simple enough so that
we can have rules.
00:59:47.590 --> 00:59:52.360
Whereas as soon as they give you
those parameters t1, t2, eta 1,
00:59:52.360 --> 00:59:55.810
eta 2, b and h, you can
actually have some simple rules
00:59:55.810 --> 01:00:00.370
to compute the mean and
variance and all those things.
01:00:00.370 --> 01:00:03.520
And so in particular, I'm
interested in the mean,
01:00:03.520 --> 01:00:05.950
and I'm going to have to
actually say, well, you know,
01:00:05.950 --> 01:00:09.770
this mean has to be mapped
into the whole real line.
01:00:09.770 --> 01:00:12.040
So I can actually talk
about modeling this function
01:00:12.040 --> 01:00:14.410
of the mean as x transpose beta.
01:00:14.410 --> 01:00:17.380
And we saw that for
the [INAUDIBLE] dataset
01:00:17.380 --> 01:00:21.400
or whatever other data sets.
01:00:21.400 --> 01:00:24.250
You actually can-- you can
actually do this using the log
01:00:24.250 --> 01:00:27.960
of the reciprocal or for the--
01:00:27.960 --> 01:00:30.050
oh, actually, we didn't
do it for the Bernoulli.
01:00:30.050 --> 01:00:30.940
We'll come to this.
01:00:30.940 --> 01:00:32.981
This is the most important
one, and that's called
01:00:32.981 --> 01:00:34.510
a logit it or a logistic link.
01:00:37.090 --> 01:00:39.230
But before we go there,
this was actually
01:00:39.230 --> 01:00:42.320
a very broad family, right?
01:00:42.320 --> 01:00:44.995
When I wrote this thing on the
bottom board-- it's gone now,
01:00:44.995 --> 01:00:46.620
but when I wrote it
in the first place,
01:00:46.620 --> 01:00:48.870
the only thing that I wrote
is I wanted x times theta.
01:00:48.870 --> 01:00:51.119
Wouldn't it be nice if you
have some distribution that
01:00:51.119 --> 01:00:53.230
was just x times theta,
not some function of x
01:00:53.230 --> 01:00:54.660
times some function of theta?
01:00:54.660 --> 01:00:58.050
The functions seem to be
here so that they actually
01:00:58.050 --> 01:01:02.610
make things a little--
01:01:02.610 --> 01:01:05.160
so the functions were here
so that I can actually
01:01:05.160 --> 01:01:06.480
put a lot of functions there.
01:01:06.480 --> 01:01:08.430
But first of all,
if I actually decide
01:01:08.430 --> 01:01:10.680
to re-parametrize my
problem, I can always
01:01:10.680 --> 01:01:12.180
assume-- if I'm
one dimensional, I
01:01:12.180 --> 01:01:14.970
can always assume
that eta 1 of theta
01:01:14.970 --> 01:01:17.460
becomes my new theta, right?
01:01:17.460 --> 01:01:20.772
So this thing--
here for example,
01:01:20.772 --> 01:01:22.230
I could say, well,
this is actually
01:01:22.230 --> 01:01:23.510
the parameter of my Bernoulli.
01:01:23.510 --> 01:01:25.950
Let me call this
guy theta, right?
01:01:25.950 --> 01:01:28.230
I could do that.
01:01:28.230 --> 01:01:31.230
Then I could say, well, here
I have x that shows up here.
01:01:31.230 --> 01:01:33.980
And here since I'm talking
about the response,
01:01:33.980 --> 01:01:35.690
I cannot really make
any transformations.
01:01:35.690 --> 01:01:38.240
So here, I'm going to actually
talk about a specific family
01:01:38.240 --> 01:01:41.920
for which this guy is not x
square or square root of x
01:01:41.920 --> 01:01:43.350
or log of x or anything I want.
01:01:43.350 --> 01:01:45.349
I'm just going to actually
look at distributions
01:01:45.349 --> 01:01:46.430
for which this is x.
01:01:46.430 --> 01:01:48.485
This exponential
families are called
01:01:48.485 --> 01:01:51.140
a canonical exponential family.
01:01:51.140 --> 01:01:55.010
So in the canonical
exponential family, what I have
01:01:55.010 --> 01:01:57.230
is that I have my x times theta.
01:01:57.230 --> 01:01:59.959
I'm going to allow myself
some normalization factor phi,
01:01:59.959 --> 01:02:01.500
and we'll see, for
example, that it's
01:02:01.500 --> 01:02:05.330
very convenient when I talk
about the Gaussian, right?
01:02:05.330 --> 01:02:07.830
Because even if I know--
01:02:11.250 --> 01:02:15.134
yeah, even if I know this guy,
which I actually pull into my--
01:02:15.134 --> 01:02:16.300
oh, that's over here, right?
01:02:20.970 --> 01:02:23.155
Right, I know sigma squared.
01:02:23.155 --> 01:02:24.780
But I don't want to
change my parameter
01:02:24.780 --> 01:02:26.290
to be mu over sigma squared.
01:02:26.290 --> 01:02:27.490
It's kind of painful.
01:02:27.490 --> 01:02:30.120
So I just take mu, and
I'm going to keep this guy
01:02:30.120 --> 01:02:31.980
as being this phi over there.
01:02:31.980 --> 01:02:34.830
And it's called the
dispersion parameter
01:02:34.830 --> 01:02:38.010
from a clear analogy
with the Gaussian, right?
01:02:38.010 --> 01:02:41.580
That's the variance and
that's measuring dispersion.
01:02:41.580 --> 01:02:45.540
OK, so here, what
I want is I'm going
01:02:45.540 --> 01:02:49.450
to think throughout this class--
so phi may be known or not.
01:02:49.450 --> 01:02:51.390
And depending--
when it's not known,
01:02:51.390 --> 01:02:54.360
this actually might turn
into some exponential family
01:02:54.360 --> 01:02:55.470
or it might not.
01:02:55.470 --> 01:03:01.380
And the main reason is because
this b of theta over phi
01:03:01.380 --> 01:03:04.950
is not necessarily a function
of theta over phi, right?
01:03:04.950 --> 01:03:09.660
If I actually have phi
unknown, then y theta over phi
01:03:09.660 --> 01:03:10.740
has to be--
01:03:10.740 --> 01:03:13.390
this guy has to be
my new parameter.
01:03:13.390 --> 01:03:17.930
And b might not be a function
of this new parameter.
01:03:17.930 --> 01:03:21.860
OK, so in a way,
it may or may not,
01:03:21.860 --> 01:03:24.710
but this is not really a
concern that we're going to have
01:03:24.710 --> 01:03:26.810
because throughout
this class, we're
01:03:26.810 --> 01:03:29.195
going to assume that
phi is known, OK?
01:03:29.195 --> 01:03:31.820
Phi is going to be known all the
time, which means that this is
01:03:31.820 --> 01:03:34.334
always an exponential family.
01:03:34.334 --> 01:03:35.750
And it's just the
simplest one you
01:03:35.750 --> 01:03:38.270
could think of-- one
dimensional parameter, one
01:03:38.270 --> 01:03:42.800
dimensional response, and I just
have-- the product is just y
01:03:42.800 --> 01:03:45.050
times or, we used to call it x.
01:03:45.050 --> 01:03:49.670
Now I've switched to y, but y
times theta divided by phi, OK?
01:03:52.550 --> 01:03:56.120
Should I write this or this is
clear to everyone what this is?
01:03:56.120 --> 01:03:58.665
Let me write it somewhere so
we actually keep track of it
01:03:58.665 --> 01:04:00.565
toward the [INAUDIBLE].
01:04:05.800 --> 01:04:07.990
OK, so this is--
01:04:07.990 --> 01:04:11.620
remember, we had all
the distributions.
01:04:11.620 --> 01:04:15.670
And then here we had
the exponential family.
01:04:15.670 --> 01:04:18.610
And now we have the
canonical exponential family.
01:04:21.280 --> 01:04:24.200
It's actually
much, much smaller.
01:04:24.200 --> 01:04:26.950
Well, actually, it's probably
sort of a good picture.
01:04:26.950 --> 01:04:32.620
And what I have is that
my density or my PMF
01:04:32.620 --> 01:04:37.120
is just exponential
y times theta minus b
01:04:37.120 --> 01:04:41.020
of theta divided by phi.
01:04:41.020 --> 01:04:46.480
And I have plus phi of--
01:04:46.480 --> 01:04:53.820
oh, yeah, plus phi
of y phi, which
01:04:53.820 --> 01:04:58.680
means that this is really--
if phi is known, h of y
01:04:58.680 --> 01:05:05.742
is just exponential
c of y phi, agreed?
01:05:05.742 --> 01:05:07.950
Actually, this is the reason
why it's not necessarily
01:05:07.950 --> 01:05:10.410
a canonical family.
01:05:10.410 --> 01:05:12.990
It might not be that
this depends only on y.
01:05:12.990 --> 01:05:15.400
It could depend on y and
phi in some annoying way
01:05:15.400 --> 01:05:18.950
and I may not be
able to break it.
01:05:18.950 --> 01:05:21.220
OK, but if phi is known,
this is just a function
01:05:21.220 --> 01:05:23.580
that depends on y, agreed?
01:05:28.290 --> 01:05:29.670
In particular, I
think you need--
01:05:29.670 --> 01:05:31.753
I hope you can convince
yourself that this is just
01:05:31.753 --> 01:05:33.750
a subcase of everything
we've seen before.
01:05:41.990 --> 01:05:44.800
So for example, the Gaussian
when the variance is known
01:05:44.800 --> 01:05:47.010
is indeed of this form, right?
01:05:47.010 --> 01:05:49.220
So we still have
it on the board.
01:05:49.220 --> 01:05:51.040
So here is my y, right?
01:05:51.040 --> 01:05:53.950
So then let me write
this as f theta of y.
01:05:53.950 --> 01:05:59.030
So every x is replaceable
with y, blah, blah, blah.
01:05:59.030 --> 01:06:01.330
This is this guy.
01:06:01.330 --> 01:06:07.120
And now what I have is that
this is going to be my phi.
01:06:07.120 --> 01:06:10.630
This is my parameter of theta.
01:06:10.630 --> 01:06:14.320
So I'm definitely of the form
y times theta divided by phi.
01:06:14.320 --> 01:06:16.440
And then here I
have a function b
01:06:16.440 --> 01:06:20.890
that depends only on
theta over phi again.
01:06:20.890 --> 01:06:27.040
So b of theta is mu
squared divided by 2.
01:06:31.000 --> 01:06:33.890
OK, then it's divided
by 6 sigma square.
01:06:33.890 --> 01:06:35.519
And then I have
this extra stuff.
01:06:35.519 --> 01:06:37.310
But I really don't care
what it is for now.
01:06:37.310 --> 01:06:42.140
It's just something that depends
only on y and known stuff.
01:06:42.140 --> 01:06:44.150
So it was just a function
of y just like my h.
01:06:44.150 --> 01:06:47.180
I stuff everything in there.
01:06:47.180 --> 01:06:50.090
The b, though, this
thing here, this
01:06:50.090 --> 01:06:52.229
is actually what's
important because
01:06:52.229 --> 01:06:53.770
in the canonical
family, if you think
01:06:53.770 --> 01:06:57.060
about it, when you know phi--
01:06:57.060 --> 01:07:03.270
sorry-- right, this
is just y times theta
01:07:03.270 --> 01:07:05.790
scaled by a known
constant-- sorry, y times
01:07:05.790 --> 01:07:08.160
theta scaled by a known
constant is the first term.
01:07:08.160 --> 01:07:12.000
The second term is b of theta
scaled by some known constant.
01:07:12.000 --> 01:07:13.860
But b of theta is
what's going to make
01:07:13.860 --> 01:07:17.580
the difference between the
Gaussian and Bernoullis
01:07:17.580 --> 01:07:19.680
and gammas and betas--
01:07:19.680 --> 01:07:21.750
this is all in this b
of theta. b of theta
01:07:21.750 --> 01:07:25.050
contains everything
that's idiosyncratic to
01:07:25.050 --> 01:07:27.270
this particular distribution.
01:07:27.270 --> 01:07:29.000
And so this is going
to be important.
01:07:29.000 --> 01:07:32.120
And we will see that b of theta
is going to capture information
01:07:32.120 --> 01:07:34.230
about the mean,
about the variance,
01:07:34.230 --> 01:07:37.133
about likelihood,
about everything.
01:07:44.710 --> 01:07:46.420
Should I go through
this computation?
01:07:46.420 --> 01:07:47.647
I mean, it's the same.
01:07:47.647 --> 01:07:48.730
We've just done it, right?
01:07:48.730 --> 01:07:53.750
So maybe it's probably better
if you can redo it on your own.
01:07:53.750 --> 01:07:56.680
All right, so the canonical
exponential family also
01:07:56.680 --> 01:07:58.210
has other distributions, right?
01:07:58.210 --> 01:08:00.890
So there's the Gaussian
and there's the Poisson
01:08:00.890 --> 01:08:02.410
and there's the Bernoulli.
01:08:02.410 --> 01:08:05.250
But the other ones may not
be part of this, right?
01:08:05.250 --> 01:08:07.810
In particular, think about
the gamma distribution.
01:08:07.810 --> 01:08:13.600
We had this-- log x was one
of the things that showed up.
01:08:13.600 --> 01:08:15.670
I mean, I cannot get
rid of this log x.
01:08:15.670 --> 01:08:18.729
I mean, that's part of it
except if a is equal to 1
01:08:18.729 --> 01:08:20.380
and I know it for sure, right?
01:08:20.380 --> 01:08:23.979
So if a is equal to 1, then
I'm going to have a minus 1,
01:08:23.979 --> 01:08:25.000
which is equal to 0.
01:08:25.000 --> 01:08:27.160
So I'm going to have
a minus 1 times log
01:08:27.160 --> 01:08:28.630
x, which is going to be just 0.
01:08:28.630 --> 01:08:30.560
So log x is going
to vanish from here.
01:08:30.560 --> 01:08:33.550
But if a is equal to 1,
then this distribution
01:08:33.550 --> 01:08:36.250
is actually much nicer, and
it actually does not even
01:08:36.250 --> 01:08:37.450
deserve the name gamma.
01:08:37.450 --> 01:08:38.890
What is it if a is equal to 1?
01:08:42.444 --> 01:08:43.569
It's an exponential, right?
01:08:43.569 --> 01:08:47.779
Gamma 1 is equal to 1. x to
the a minus 1 is equal to 1.
01:08:47.779 --> 01:08:51.529
b-- so I have exponential
x over b divided by b.
01:08:51.529 --> 01:08:53.520
So 1 over b-- call it lambda.
01:08:53.520 --> 01:08:56.759
And this is just an
exponential distribution.
01:08:56.759 --> 01:08:58.800
And so every time you're
going to see something--
01:08:58.800 --> 01:09:02.590
so all these guys that
don't make it to this table,
01:09:02.590 --> 01:09:06.094
they could be part of those
guys, but they're just more--
01:09:06.094 --> 01:09:09.050
they're just to--
01:09:09.050 --> 01:09:10.939
they just have another
name in this thing.
01:09:10.939 --> 01:09:13.970
All right, so you could
compute the value of theta
01:09:13.970 --> 01:09:15.520
for different values, right?
01:09:15.520 --> 01:09:18.714
So again, you still have some
continuous or discrete ones.
01:09:18.714 --> 01:09:19.630
This is my b of theta.
01:09:19.630 --> 01:09:22.520
And I said this is actually
really what captures my theta.
01:09:22.520 --> 01:09:26.450
This b is actually called
cumulant generating function,
01:09:26.450 --> 01:09:27.160
OK?
01:09:27.160 --> 01:09:28.300
I don't have time.
01:09:28.300 --> 01:09:30.370
I could write five
slides to explain to you,
01:09:30.370 --> 01:09:32.729
but it would just only
tell you why it's called
01:09:32.729 --> 01:09:34.390
cumulant generating function.
01:09:34.390 --> 01:09:38.090
It's also known as the log of
the moment generating function.
01:09:38.090 --> 01:09:42.195
And the way it's called
cumulant generating function
01:09:42.195 --> 01:09:44.320
is because if I start taking
successive derivatives
01:09:44.320 --> 01:09:47.584
and evaluating them at 0, I
get the successive cumulance
01:09:47.584 --> 01:09:50.859
of this distribution, which
are some transformation
01:09:50.859 --> 01:09:51.815
of the moments.
01:09:51.815 --> 01:09:53.654
AUDIENCE: What are you
talking about again?
01:09:53.654 --> 01:09:55.070
PHILIPPE RIGOLLET:
The function b.
01:09:55.070 --> 01:09:55.945
AUDIENCE: [INAUDIBLE]
01:09:55.945 --> 01:09:57.986
PHILIPPE RIGOLLET: So this
is just normalization.
01:09:57.986 --> 01:10:00.170
So this is just to tell
you I can compute this,
01:10:00.170 --> 01:10:01.640
but I really don't care.
01:10:01.640 --> 01:10:04.460
And obviously I don't care
about stuff that's complicated.
01:10:04.460 --> 01:10:07.316
This is actually cute, and this
is what completes everything.
01:10:07.316 --> 01:10:09.440
And the rest is just like
some general description.
01:10:09.440 --> 01:10:11.930
You only need to tell
you that the range of y
01:10:11.930 --> 01:10:14.090
is 0 to infinity, right?
01:10:14.090 --> 01:10:16.469
And that is
essentially telling me
01:10:16.469 --> 01:10:19.010
this is going to give me some
hints as to which link function
01:10:19.010 --> 01:10:20.180
I should be using, right?
01:10:20.180 --> 01:10:21.710
Because the range
of y tells me what
01:10:21.710 --> 01:10:23.846
the range of expectation
of y is going to be.
01:10:23.846 --> 01:10:25.970
All right, so here, it
tells me that the range of y
01:10:25.970 --> 01:10:28.850
is between 0 and 1.
01:10:28.850 --> 01:10:30.500
OK, so what I want
to show you is
01:10:30.500 --> 01:10:33.134
that this captures a
variety of different ranges
01:10:33.134 --> 01:10:34.088
that you can have.
01:10:40.300 --> 01:10:46.570
OK, so I'm going to want
to go into the likelihood.
01:10:46.570 --> 01:10:48.460
And the likelihood
I'm actually going
01:10:48.460 --> 01:10:50.780
to use to compute
the expectations.
01:10:50.780 --> 01:10:52.840
But since I actually
don't have time
01:10:52.840 --> 01:10:55.690
to do this now, let's just
go quickly through this
01:10:55.690 --> 01:10:59.770
and give you spoiler alert to
make sure that you all wake up
01:10:59.770 --> 01:11:01.270
on Thursday and
really, really want
01:11:01.270 --> 01:11:03.160
to think about coming
here immediately.
01:11:03.160 --> 01:11:05.470
All right, so the thing
I'm going to want to do,
01:11:05.470 --> 01:11:07.570
as I said, is it would
be nice if, at least
01:11:07.570 --> 01:11:11.434
for this canonical
family, when I give you b,
01:11:11.434 --> 01:11:12.850
you would be able
to say, oh, here
01:11:12.850 --> 01:11:16.340
is a simple computation of b
that would actually give me
01:11:16.340 --> 01:11:17.530
the mean and the variance.
01:11:17.530 --> 01:11:20.590
The mean and the variance
are also known as moments.
01:11:20.590 --> 01:11:22.970
b is called cumulant
generating function.
01:11:22.970 --> 01:11:24.940
So it sounds like
moments being related
01:11:24.940 --> 01:11:28.060
to cumulance, I might have a
path to finding those, right?
01:11:28.060 --> 01:11:31.660
And it might involve taking
derivatives of b, as we'll see.
01:11:31.660 --> 01:11:33.330
The way we're
going to prove this
01:11:33.330 --> 01:11:36.820
by using this thing that
we've used several times.
01:11:36.820 --> 01:11:39.354
So this property we use
when we're computing,
01:11:39.354 --> 01:11:41.020
remember, the fisher
information, right?
01:11:41.020 --> 01:11:43.420
We had two formulas for
the fisher information.
01:11:43.420 --> 01:11:49.210
One was the expectation of the
second derivative of the log
01:11:49.210 --> 01:11:53.026
likelihood, and one was negative
expectation of the square--
01:11:53.026 --> 01:11:55.150
sorry, expectation of the
square, and the other one
01:11:55.150 --> 01:11:57.970
was negative the expectation of
the second derivative, right?
01:11:57.970 --> 01:12:00.850
The log likelihood is concave,
so this number is negative,
01:12:00.850 --> 01:12:02.470
this number is positive.
01:12:02.470 --> 01:12:04.990
And the way we did this is by
just permuting some derivative
01:12:04.990 --> 01:12:06.004
and integral here.
01:12:06.004 --> 01:12:08.170
And there was just-- we
used the fact that something
01:12:08.170 --> 01:12:09.378
that looked like this, right?
01:12:09.378 --> 01:12:13.780
The log likelihood
is log of f theta.
01:12:13.780 --> 01:12:20.500
And when I take the derivative
of this guy with respect
01:12:20.500 --> 01:12:24.690
to theta, then I
have something that
01:12:24.690 --> 01:12:30.460
looks like the derivative
divided by f theta.
01:12:30.460 --> 01:12:34.020
And if I start taking the
integral against f theta
01:12:34.020 --> 01:12:39.270
of this thing, so the
expectation of this thing,
01:12:39.270 --> 01:12:42.420
those things would cancel.
01:12:42.420 --> 01:12:45.739
And then I had just the
integral of a derivative, which
01:12:45.739 --> 01:12:48.030
I would make a leap of faith
and say that it's actually
01:12:48.030 --> 01:12:49.321
the derivative of the integral.
01:12:53.770 --> 01:12:56.000
But this was equal to 1.
01:12:56.000 --> 01:12:58.404
So this derivative was
actually equal to 0.
01:12:58.404 --> 01:13:00.320
And so that's how you
got that the expectation
01:13:00.320 --> 01:13:02.930
of the derivative of the log
likelihood is equal to 0.
01:13:02.930 --> 01:13:04.940
And you do it once again
and you get this guy.
01:13:04.940 --> 01:13:06.350
It's just some nice
things that happen
01:13:06.350 --> 01:13:08.433
with the [INAUDIBLE] taking
derivative of the log.
01:13:08.433 --> 01:13:10.430
We've done that,
we'll do that again.
01:13:10.430 --> 01:13:13.660
But once you do this, you
can actually apply it.
01:13:13.660 --> 01:13:17.580
And-- missing a
parenthesis over there.
01:13:17.580 --> 01:13:19.610
So when you write
the log likelihood,
01:13:19.610 --> 01:13:21.266
it's just log of an exponential.
01:13:21.266 --> 01:13:22.640
Huh, that's actually
pretty nice.
01:13:22.640 --> 01:13:25.020
Just like the least squares
came naturally, the least
01:13:25.020 --> 01:13:26.436
squares [INAUDIBLE]
came naturally
01:13:26.436 --> 01:13:29.204
when we took the log
likelihood of the Gaussians,
01:13:29.204 --> 01:13:31.370
we're going to have the
same thing that happens when
01:13:31.370 --> 01:13:33.080
I take the log of the density.
01:13:33.080 --> 01:13:35.300
The exponential is
going to go away,
01:13:35.300 --> 01:13:36.990
and then I'm going
to use this formula.
01:13:36.990 --> 01:13:39.770
But this formula is
going to actually give me
01:13:39.770 --> 01:13:43.026
an equation directly--
oh, that's where it was.
01:13:43.026 --> 01:13:44.840
So that's the one
that's missing up there.
01:13:44.840 --> 01:13:49.010
And so the expectation
minus this thing
01:13:49.010 --> 01:13:50.780
is going to be equal
to 0, which tells me
01:13:50.780 --> 01:13:53.122
that the expectation
is just the derivative.
01:13:53.122 --> 01:13:55.190
Right, so it's still
a function of theta,
01:13:55.190 --> 01:13:57.410
but it's just a derivative of b.
01:13:57.410 --> 01:13:59.660
And the variance
is just going to be
01:13:59.660 --> 01:14:01.280
the second derivative of b.
01:14:01.280 --> 01:14:03.910
But remember, this was some
sort of a scaling, right?
01:14:03.910 --> 01:14:05.990
It's called the
dispersion parameter.
01:14:05.990 --> 01:14:09.410
So if I had a Gaussian and
the variance of the Gaussian
01:14:09.410 --> 01:14:12.020
did not depend on
the sigma squared
01:14:12.020 --> 01:14:15.260
which I stuffed in this phi,
that would be certainly weird.
01:14:15.260 --> 01:14:17.601
And it cannot depend only
on mu, and so this will--
01:14:17.601 --> 01:14:19.850
for the Gaussian, this is
definitely going to be equal
01:14:19.850 --> 01:14:20.960
to 1.
01:14:20.960 --> 01:14:24.350
And this is just going to
be equal to my variance.
01:14:24.350 --> 01:14:28.460
So this is just by taking
the second derivative.
01:14:28.460 --> 01:14:33.260
So basically, the take-home
message is that this function b
01:14:33.260 --> 01:14:35.170
captures--
01:14:35.170 --> 01:14:37.090
by taking one derivative
of the expectation
01:14:37.090 --> 01:14:39.565
and by taking two derivatives
captures the variance.
01:14:39.565 --> 01:14:41.200
Another thing
that's actually cool
01:14:41.200 --> 01:14:42.580
and we'll come
back to this and I
01:14:42.580 --> 01:14:45.640
want to think about is if
this second derivative is
01:14:45.640 --> 01:14:49.190
the variance, what can
I say about this thing?
01:14:52.037 --> 01:14:53.370
What do I know about a variance?
01:14:53.370 --> 01:14:54.950
AUDIENCE: [INAUDIBLE]
01:14:54.950 --> 01:14:56.730
PHILIPPE RIGOLLET:
Yeah, that's positive.
01:14:56.730 --> 01:14:58.110
So I know that this is positive.
01:14:58.110 --> 01:15:00.600
So what does that tell me?
01:15:00.600 --> 01:15:03.115
Positive?
01:15:03.115 --> 01:15:03.990
That's convex, right?
01:15:03.990 --> 01:15:07.050
A function that has positive
second derivative is convex.
01:15:07.050 --> 01:15:09.700
So we're going to use
that as well, all right?
01:15:09.700 --> 01:15:12.530
So yeah, I'll see
you on Thursday.
01:15:12.530 --> 01:15:14.380
I have your homework.