WEBVTT
00:00:00.120 --> 00:00:02.460
The following content is
provided under a Creative
00:00:02.460 --> 00:00:03.880
Commons license.
00:00:03.880 --> 00:00:06.090
Your support will help
MIT OpenCourseWare
00:00:06.090 --> 00:00:10.180
continue to offer high quality
educational resources for free.
00:00:10.180 --> 00:00:12.720
To make a donation or to
view additional materials
00:00:12.720 --> 00:00:16.680
from hundreds of MIT courses,
visit MIT OpenCourseWare
00:00:16.680 --> 00:00:17.610
at ocw.mit.edu.
00:00:20.340 --> 00:00:23.280
PHILIPPE RIGOLLET:
We're talking about
00:00:23.280 --> 00:00:24.390
generalized linear models.
00:00:24.390 --> 00:00:25.764
And in generalized
linear models,
00:00:25.764 --> 00:00:28.780
we generalize linear
models in two ways.
00:00:28.780 --> 00:00:31.170
The first one is to allow
for a different distribution
00:00:31.170 --> 00:00:32.910
for the response variables.
00:00:32.910 --> 00:00:34.560
And the distributions
that we wanted
00:00:34.560 --> 00:00:37.230
was the exponential family.
00:00:44.240 --> 00:00:46.910
And this is a family
that can be generalized
00:00:46.910 --> 00:00:49.340
over random variables
that are defined
00:00:49.340 --> 00:00:52.640
on c or q in general,
with parameters rk.
00:00:52.640 --> 00:00:58.310
But we're going to focus in
a very specific case when
00:00:58.310 --> 00:01:00.710
y is a real valued
response variable, which
00:01:00.710 --> 00:01:04.040
is the one you're used to when
you're doing linear regression.
00:01:04.040 --> 00:01:09.500
And the parameter
theta also lives in r.
00:01:09.500 --> 00:01:12.250
And so we're going to talk
about the canonical case.
00:01:12.250 --> 00:01:15.920
So that's the canonical
exponential family,
00:01:15.920 --> 00:01:19.760
where you have a density,
theta of x, which is
00:01:19.760 --> 00:01:25.280
of the form, exponential plus.
00:01:25.280 --> 00:01:27.800
And then, we have y,
which interacts with theta
00:01:27.800 --> 00:01:29.580
only by taking a product.
00:01:29.580 --> 00:01:32.990
Then, there's a term that
depends only on theta,
00:01:32.990 --> 00:01:35.180
some dispersion parameter phi.
00:01:35.180 --> 00:01:37.340
And then, we have some
normalization factor.
00:01:37.340 --> 00:01:44.900
Let's call it c of y phi.
00:01:44.900 --> 00:01:48.340
So it really should not matter
too much, so it's c of y phi,
00:01:48.340 --> 00:01:50.520
and that's really just the
normal position factor.
00:01:50.520 --> 00:01:54.010
And here, we're going to
assume that phi is known.
00:01:57.480 --> 00:01:58.694
I have no idea what I write.
00:01:58.694 --> 00:02:00.110
I don't know if
you guys can read.
00:02:00.110 --> 00:02:01.943
I don't know what chalk
has been used today,
00:02:01.943 --> 00:02:05.480
but I just can't see it.
00:02:05.480 --> 00:02:08.694
That's not my fault. All
right, so we're going
00:02:08.694 --> 00:02:09.860
to assume that phi is known.
00:02:09.860 --> 00:02:12.021
And so we saw that
several distributions
00:02:12.021 --> 00:02:14.270
that we know well, including
the Gaussian for example,
00:02:14.270 --> 00:02:15.612
belong to this family.
00:02:15.612 --> 00:02:17.320
And there's other
ones, such as Poisson--
00:02:21.084 --> 00:02:22.000
Poisson and Bernoulli.
00:02:22.000 --> 00:02:24.040
So if the PMF has
this form, if you
00:02:24.040 --> 00:02:27.610
have a discrete random
variable, this is also valid.
00:02:27.610 --> 00:02:29.779
And the reason why we
introduced this family
00:02:29.779 --> 00:02:32.320
is because there are going to
be some properties that we know
00:02:32.320 --> 00:02:34.960
that this thing here,
this function, b of theta,
00:02:34.960 --> 00:02:37.540
is essentially what
completely characterizes
00:02:37.540 --> 00:02:38.950
your distribution.
00:02:38.950 --> 00:02:42.504
So if phi is fixed, we know that
the interaction is the form.
00:02:42.504 --> 00:02:44.170
And this really just
comes from the fact
00:02:44.170 --> 00:02:46.490
that we want the function
to integrate to one.
00:02:46.490 --> 00:02:49.180
So this b here in
the canonical form
00:02:49.180 --> 00:02:50.860
encodes everything
we want to know.
00:02:50.860 --> 00:02:53.164
If I tell you what
b of theta is--
00:02:53.164 --> 00:02:54.580
and of course, I
tell you what phi
00:02:54.580 --> 00:02:56.914
is, but let's say for a second
that phi is equal to one.
00:02:56.914 --> 00:02:58.330
If I tell you this
b of theta, you
00:02:58.330 --> 00:03:00.420
know exactly what distribution
I'm talking about.
00:03:00.420 --> 00:03:02.520
So it should encode
everything that's
00:03:02.520 --> 00:03:05.920
specific to this distribution,
such as mean, variance,
00:03:05.920 --> 00:03:07.780
all the moments
that you would want.
00:03:07.780 --> 00:03:12.310
And we'll see how we can
compute from this thing
00:03:12.310 --> 00:03:14.590
the mean and the
variance, for example.
00:03:14.590 --> 00:03:16.567
So today, we're going to
talk about likelihood,
00:03:16.567 --> 00:03:18.400
and we're going to start
with the likelihood
00:03:18.400 --> 00:03:21.100
function or the log likelihood
for one observation.
00:03:21.100 --> 00:03:23.440
From this, we're going
to do some computations,
00:03:23.440 --> 00:03:26.590
and then, we'll move on to the
actual log likelihood based
00:03:26.590 --> 00:03:28.750
on n independent observations.
00:03:28.750 --> 00:03:30.730
And here, as we will
see, the observations
00:03:30.730 --> 00:03:32.950
are not going to be
identically distributed,
00:03:32.950 --> 00:03:35.080
because we're going
to want each of them,
00:03:35.080 --> 00:03:39.530
conditionally on x to be a
different function of x, where
00:03:39.530 --> 00:03:41.200
theta is just a
different function of x
00:03:41.200 --> 00:03:43.230
for each of the observation.
00:03:43.230 --> 00:03:45.649
So remember, the
log likelihood--
00:03:50.050 --> 00:03:52.630
and this is for
one observation--
00:03:52.630 --> 00:03:54.400
is just the log of
the density, right?
00:03:59.090 --> 00:04:02.840
And we have this
identity that I mentioned
00:04:02.840 --> 00:04:04.530
at the end of the
class on Tuesday.
00:04:04.530 --> 00:04:06.800
And this identity is
just that the expectation
00:04:06.800 --> 00:04:08.960
of the derivative of this
guy with respect to theta
00:04:08.960 --> 00:04:10.430
is equal to 0.
00:04:10.430 --> 00:04:11.330
So let's see why.
00:04:11.330 --> 00:04:15.610
So if I take the derivative
with respect to theta, of log f,
00:04:15.610 --> 00:04:18.860
theta of x, what I
get is the derivative
00:04:18.860 --> 00:04:21.930
with respect to
theta of f, theta
00:04:21.930 --> 00:04:26.880
of x, divided by f theta of x.
00:04:26.880 --> 00:04:30.820
Now, if I take the
expectation of this guy,
00:04:30.820 --> 00:04:37.810
with respect to this
theta as well, what I get
00:04:37.810 --> 00:04:40.210
is that this thing--
what is the expectation?
00:04:40.210 --> 00:04:42.970
Well, it's just the
integral against f theta.
00:04:42.970 --> 00:04:45.040
Or if I'm in a
discrete case, I just
00:04:45.040 --> 00:04:48.590
have the sum against f
theta, if it's a pmf.
00:04:48.590 --> 00:04:53.320
Just the definition,
the expectation of x,
00:04:53.320 --> 00:04:56.790
is either the integral--
well, let's say of h of x--
00:04:56.790 --> 00:04:59.770
is integral of h of x.
00:04:59.770 --> 00:05:01.390
F theta of x--
00:05:01.390 --> 00:05:04.090
if this is discrete
or is just the sum
00:05:04.090 --> 00:05:07.960
of h of x, f theta of x.
00:05:07.960 --> 00:05:09.880
If x is discrete--
00:05:09.880 --> 00:05:15.115
so if it's continuous,
you put this soft sum.
00:05:15.115 --> 00:05:17.110
This guy is the
same thing, right?
00:05:17.110 --> 00:05:20.450
So I'm just going to illustrate
the case when it's continuous.
00:05:20.450 --> 00:05:21.300
So this is what?
00:05:21.300 --> 00:05:24.790
Well, this is the integral of
partial derivative with respect
00:05:24.790 --> 00:05:29.740
to theta, of f theta of
x, divided by f theta
00:05:29.740 --> 00:05:35.060
of x, all time f theta of x--
00:05:35.060 --> 00:05:36.526
dx.
00:05:36.526 --> 00:05:38.627
And now, this f
theta is canceled,
00:05:38.627 --> 00:05:40.210
so I'm actually left
with the integral
00:05:40.210 --> 00:05:41.290
of the derivative,
which I'm going
00:05:41.290 --> 00:05:43.081
to write as the derivative
of the integral.
00:05:50.690 --> 00:06:01.640
But f theta being density
for any value of theta
00:06:01.640 --> 00:06:04.150
that I can take,
this is the function.
00:06:04.150 --> 00:06:07.720
As a function of
theta, this function
00:06:07.720 --> 00:06:10.670
is constantly equal to 1.
00:06:10.670 --> 00:06:13.710
For any theta that I take
it, it takes value of 1.
00:06:13.710 --> 00:06:16.344
So this is constantly
equal to 1.
00:06:16.344 --> 00:06:18.510
I put three bars to see
that for any value of theta,
00:06:18.510 --> 00:06:21.830
this is 1, which actually
tells me that the derivative is
00:06:21.830 --> 00:06:24.200
equal to 0.
00:06:24.200 --> 00:06:25.010
OK, yes?
00:06:30.455 --> 00:06:32.930
AUDIENCE: What is
the first [INAUDIBLE]
00:06:32.930 --> 00:06:34.415
that you wrote on the board?
00:06:38.666 --> 00:06:40.540
PHILIPPE RIGOLLET: That's
just the definition
00:06:40.540 --> 00:06:44.396
of the derivative of
the log of a function?
00:06:44.396 --> 00:06:45.364
AUDIENCE: OK.
00:06:49.720 --> 00:06:53.660
PHILIPPE RIGOLLET: Log of
f prime is f prime over f.
00:06:53.660 --> 00:06:56.060
That's a log, yeah.
00:06:56.060 --> 00:06:59.735
Just by elimination.
00:06:59.735 --> 00:07:01.652
AUDIENCE: [INAUDIBLE]
00:07:01.652 --> 00:07:02.860
PHILIPPE RIGOLLET: I'm sorry.
00:07:02.860 --> 00:07:05.276
AUDIENCE: When you write a
squiggle that starts with an l,
00:07:05.276 --> 00:07:06.680
I assume it's lambda.
00:07:06.680 --> 00:07:09.420
PHILIPPE RIGOLLET: And you do
good, because that's probably
00:07:09.420 --> 00:07:11.370
how my mind processes.
00:07:11.370 --> 00:07:13.310
And so I'm like, yeah, l.
00:07:13.310 --> 00:07:16.820
Here is enough information.
00:07:16.820 --> 00:07:19.290
OK, everybody is good with this?
00:07:19.290 --> 00:07:21.260
So that was convenient.
00:07:21.260 --> 00:07:22.860
So it just said
that the expectation
00:07:22.860 --> 00:07:26.970
of the derivative of the log
likelihood is equal to 0.
00:07:26.970 --> 00:07:29.340
That's going to be
our first identity.
00:07:29.340 --> 00:07:30.900
Let's move onto the
second identity,
00:07:30.900 --> 00:07:34.140
using exactly the same
trick, which is let's hope
00:07:34.140 --> 00:07:35.850
that at some point,
we have the integral
00:07:35.850 --> 00:07:37.266
of this function
that's constantly
00:07:37.266 --> 00:07:41.550
equal to 1 as a function of
theta, and then use the fact
00:07:41.550 --> 00:07:43.650
that its derivative
is equal to 0.
00:07:43.650 --> 00:07:54.850
So if I start taking the
second derivative of the log
00:07:54.850 --> 00:07:57.470
of f theta, so what is this?
00:07:57.470 --> 00:08:00.600
Well, it's the
derivative of this guy
00:08:00.600 --> 00:08:02.720
here, so I'm going
to go straight to it.
00:08:02.720 --> 00:08:08.830
So it's second derivative,
f theta of x, times
00:08:08.830 --> 00:08:19.810
f theta of x, minus first
derivative of f theta of x,
00:08:19.810 --> 00:08:22.160
times first derivative
of f theta of x.
00:08:26.360 --> 00:08:29.670
Here is some super
important stuff--
00:08:29.670 --> 00:08:31.740
no, I'm kidding.
00:08:31.740 --> 00:08:34.080
So you can still see
that guy over there?
00:08:34.080 --> 00:08:35.870
So it's just the square.
00:08:35.870 --> 00:08:38.100
And then, I divide by
f theta of x squared.
00:08:43.890 --> 00:08:48.340
So here I have the second
derivative, times f itself.
00:08:48.340 --> 00:08:51.630
And here, I have the product
of the first derivative
00:08:51.630 --> 00:08:52.160
with itself.
00:08:52.160 --> 00:08:53.544
So that's the square.
00:08:53.544 --> 00:08:55.210
So now, I'm going to
integrate this guy.
00:08:55.210 --> 00:09:01.550
So if I take the expectation
of this thing here, what I get
00:09:01.550 --> 00:09:03.741
is the integral.
00:09:03.741 --> 00:09:05.240
So here, the only
thing that's going
00:09:05.240 --> 00:09:07.073
to happen when I'm going
to take my integral
00:09:07.073 --> 00:09:09.380
is that one of those
squares is going to cancel
00:09:09.380 --> 00:09:10.940
against f theta, right?
00:09:10.940 --> 00:09:22.430
So I'm going to get
the second derivative
00:09:22.430 --> 00:09:24.830
minus the second
derivative squared.
00:09:32.950 --> 00:09:34.560
And then, I'm
divided by f theta.
00:09:38.120 --> 00:09:39.850
And I know that this
thing is equal to 0.
00:09:44.460 --> 00:09:46.660
Now, one of these guys here--
00:09:46.660 --> 00:09:48.180
sorry, why do I have--
00:09:48.180 --> 00:09:49.350
so I have this guy here.
00:09:49.350 --> 00:09:50.865
So this guy here
is going to cancel.
00:09:53.440 --> 00:09:57.320
So this is what
this is equal to--
00:09:59.970 --> 00:10:05.595
the integral of the partial,
so the second derivative of f
00:10:05.595 --> 00:10:09.620
theta of x, because
those two guys cancel--
00:10:09.620 --> 00:10:14.156
minus the integral of
the second derivative.
00:10:24.380 --> 00:10:28.360
And this is telling me what?
00:10:55.480 --> 00:10:58.040
Yeah, I'm losing one, because
I have some weird sequences.
00:10:58.040 --> 00:10:58.539
Thank you.
00:11:03.210 --> 00:11:12.282
OK, this is still positive.
00:11:12.282 --> 00:11:14.490
I want to say that this
thing is actually equal to 0.
00:11:17.040 --> 00:11:19.410
But then, it gives
me some weird things,
00:11:19.410 --> 00:11:24.150
which are that I
have an integral
00:11:24.150 --> 00:11:26.080
of a positive function,
which is equal to 0.
00:11:32.814 --> 00:11:34.480
Yeah, that's what I'm
thinking of doing.
00:11:34.480 --> 00:11:37.230
But I'm going to get 0 for
this entire integral, which
00:11:37.230 --> 00:11:39.729
means that I have the integral
of a positive function, which
00:11:39.729 --> 00:11:42.810
is equal to 0, which means that
this function is equal to 0,
00:11:42.810 --> 00:11:44.940
which sounds a little bad--
00:11:44.940 --> 00:11:48.310
basically, tells me that this
function, f theta, is linear.
00:11:52.710 --> 00:11:55.259
So I went a little
too far, I believe,
00:11:55.259 --> 00:11:57.300
because I only want to
prove that the expectation
00:11:57.300 --> 00:11:58.937
of the second derivative--
00:12:24.190 --> 00:12:25.970
Yes, so I want to pull this out.
00:12:31.020 --> 00:12:35.229
So let's see, if I keep rolling
with this, I'm going to get--
00:12:35.229 --> 00:12:37.520
well, no because the fact
that it's divided by f theta,
00:12:37.520 --> 00:12:40.670
means that, indeed, the second
derivative is equal to 0.
00:12:40.670 --> 00:12:41.960
So it cannot do this here.
00:12:49.446 --> 00:12:51.438
AUDIENCE: [INAUDIBLE]
00:12:59.920 --> 00:13:03.120
PHILIPPE RIGOLLET: OK, but
let's write it like this.
00:13:03.120 --> 00:13:05.020
You're right, so this is what?
00:13:05.020 --> 00:13:12.200
This is the expectation of
the partial with respect
00:13:12.200 --> 00:13:15.740
to theta of f
theta of x, divided
00:13:15.740 --> 00:13:21.360
by f theta of x squared.
00:13:21.360 --> 00:13:25.530
And this is exactly the
derivative of the log, right?
00:13:25.530 --> 00:13:28.830
So indeed, this thing is equal
to the expectation with respect
00:13:28.830 --> 00:13:34.980
to theta of the partial of l--
00:13:34.980 --> 00:13:41.160
of log f theta, divided
by partial theta.
00:13:41.160 --> 00:13:44.660
All right, so this is one of
the guys that I want squared.
00:13:44.660 --> 00:13:47.510
This is one of the
guys that I want.
00:13:47.510 --> 00:13:50.810
And this is actually
equal, so this will
00:13:50.810 --> 00:13:53.270
be equal to the expectation--
00:13:56.186 --> 00:13:58.130
AUDIENCE: [INAUDIBLE]
00:13:59.110 --> 00:14:02.860
PHILIPPE RIGOLLET: Oh, right, so
this term should be equal to 0.
00:14:02.860 --> 00:14:03.940
This was not 0.
00:14:03.940 --> 00:14:04.990
You're absolutely right.
00:14:04.990 --> 00:14:06.672
So at some point,
I got confused,
00:14:06.672 --> 00:14:08.380
because I thought
putting this equal to 0
00:14:08.380 --> 00:14:09.463
would mean that this is 0.
00:14:09.463 --> 00:14:10.840
But this thing is
not equal to 0.
00:14:10.840 --> 00:14:11.650
So this thing, you're right.
00:14:11.650 --> 00:14:13.858
I take the same trick as
before, and this is actually
00:14:13.858 --> 00:14:16.900
equal to 0, which
means that now I have
00:14:16.900 --> 00:14:19.690
what's on the left-hand side,
which is equal to what's
00:14:19.690 --> 00:14:20.720
on the right-hand side.
00:14:20.720 --> 00:14:23.220
And if I recap, I
get that e theta
00:14:23.220 --> 00:14:30.750
of the second derivative
of the log of f theta
00:14:30.750 --> 00:14:32.670
is equal to minus--
00:14:32.670 --> 00:14:34.360
because I had a
minus sign here--
00:14:34.360 --> 00:14:39.300
to the expectation with respect
to theta, of log of f theta,
00:14:39.300 --> 00:14:44.100
divided by theta squared.
00:14:44.100 --> 00:14:48.720
Thank you for being on watch
when I'm falling apart.
00:14:48.720 --> 00:14:50.390
All right, so this
is exactly what
00:14:50.390 --> 00:14:52.140
you have here, except
that both terms have
00:14:52.140 --> 00:14:54.180
been put on the same side.
00:14:54.180 --> 00:14:57.550
All right, so those things
are going to be useful to us,
00:14:57.550 --> 00:14:59.880
so maybe, we should write
them somewhere here.
00:15:13.820 --> 00:15:16.210
And then, we have
that the expectation
00:15:16.210 --> 00:15:26.090
of the second
derivative of the log
00:15:26.090 --> 00:15:33.170
is equal to minus the
expectation of the square
00:15:33.170 --> 00:15:34.941
of the first derivative.
00:15:40.020 --> 00:15:42.750
And this is, indeed,
my Fisher information.
00:15:42.750 --> 00:15:48.030
This is just telling me what is
the second derivative of my log
00:15:48.030 --> 00:15:49.217
likelihood at theta, right?
00:15:49.217 --> 00:15:50.800
So everything is
with respect to theta
00:15:50.800 --> 00:15:52.440
when I take these expectations.
00:15:52.440 --> 00:15:55.140
And so it tells me
that the expectation
00:15:55.140 --> 00:15:58.530
of the second derivative--
at least first of all, what
00:15:58.530 --> 00:16:00.150
it's telling me is
that it's concave,
00:16:00.150 --> 00:16:02.970
because the second
derivative of this thing,
00:16:02.970 --> 00:16:05.910
which is the second
derivative of kl divergence,
00:16:05.910 --> 00:16:09.436
is actually minus something
which is must be non-negative.
00:16:09.436 --> 00:16:11.310
And so it's telling me
that it's concave here
00:16:11.310 --> 00:16:14.646
at this [INAUDIBLE].
00:16:14.646 --> 00:16:16.270
And in particular,
it's also telling me
00:16:16.270 --> 00:16:19.240
that it has to be strictly
positive, unless the derivative
00:16:19.240 --> 00:16:21.040
of f is equal to 0.
00:16:21.040 --> 00:16:27.700
Unless f is constant, then
it's not going to change.
00:16:27.700 --> 00:16:32.920
All right, do you
have a question?
00:16:32.920 --> 00:16:35.660
So now, let's use this.
00:16:35.660 --> 00:16:37.660
So what does my
log likelihood look
00:16:37.660 --> 00:16:41.390
like when I actually compute it
for this canonical exponential
00:16:41.390 --> 00:16:42.760
family.
00:16:42.760 --> 00:16:45.310
We have this exponential
function, so taking the log
00:16:45.310 --> 00:16:48.260
should make my life much
easier, and indeed, it does.
00:16:48.260 --> 00:16:56.250
So if I look at the
canonical, what I have
00:16:56.250 --> 00:17:04.339
is that the log of f
theta of x, it's equal
00:17:04.339 --> 00:17:10.849
simply to y theta minus b
of theta, divided by phi,
00:17:10.849 --> 00:17:14.880
plus this function that
does not depend on theta.
00:17:18.790 --> 00:17:20.440
So let's see what this tells me.
00:17:20.440 --> 00:17:23.329
Let's just plug-in those
equalities in there.
00:17:23.329 --> 00:17:25.329
I can take the derivative
of the right-hand side
00:17:25.329 --> 00:17:28.060
and just say that in
expectation, it's equal to 0.
00:17:28.060 --> 00:17:32.300
So if I start looking
at the derivative,
00:17:32.300 --> 00:17:33.780
this is equal to what?
00:17:33.780 --> 00:17:37.820
Well, here I'm going
to pick up only y.
00:17:37.820 --> 00:17:39.160
Sorry, this is a function of y.
00:17:46.585 --> 00:17:48.460
I was talking about
likelihood, so I actually
00:17:48.460 --> 00:17:50.630
need to put the
random variable here.
00:17:50.630 --> 00:17:54.750
So I get y minus the
derivative of b of theta.
00:17:54.750 --> 00:17:56.250
Since it's only a
function of theta,
00:17:56.250 --> 00:17:58.180
I'm just going to write
b prime, is at OK--
00:17:58.180 --> 00:18:01.450
rather than having the
partial with respect to theta.
00:18:01.450 --> 00:18:02.932
Then, this is a constant.
00:18:02.932 --> 00:18:04.890
This does not depend on
theta, so it goes away.
00:18:10.200 --> 00:18:15.970
So if I start taking the
expectation of this guy,
00:18:15.970 --> 00:18:20.200
I get the expectation
of this guy,
00:18:20.200 --> 00:18:24.960
which is the expectation
of y, minus-- well,
00:18:24.960 --> 00:18:27.100
this does not depend on
y, so it's just itself--
00:18:27.100 --> 00:18:28.390
b prime of theta.
00:18:28.390 --> 00:18:30.460
And the whole thing
is divided by phi.
00:18:30.460 --> 00:18:33.100
But from my first
equality over there,
00:18:33.100 --> 00:18:35.020
I know that this thing
is actually equal to 0.
00:18:38.680 --> 00:18:40.740
We just proved that.
00:18:40.740 --> 00:18:43.420
So in particular, it means
that since phi is non-zero,
00:18:43.420 --> 00:18:45.550
it means that this guy
must be equal to this guy--
00:18:45.550 --> 00:18:47.500
or phi is not infinity.
00:18:47.500 --> 00:18:50.395
And so that implies
that the expectation
00:18:50.395 --> 00:18:56.310
with respect to theta of y
is equal to b prime of theta.
00:19:02.322 --> 00:19:04.280
I'm sorry, you're not
registered in this class.
00:19:04.280 --> 00:19:07.230
I'm going to have
to ask you to leave.
00:19:07.230 --> 00:19:09.150
I'm not kidding.
00:19:09.150 --> 00:19:10.570
AUDIENCE: [INAUDIBLE]
00:19:11.070 --> 00:19:12.520
PHILIPPE RIGOLLET: You are?
00:19:12.520 --> 00:19:13.970
I've never seen you here.
00:19:13.970 --> 00:19:15.861
I saw you for the first lecture.
00:19:15.861 --> 00:19:16.360
OK.
00:19:19.120 --> 00:19:23.105
All right, so e theta of y
is equal to b prime of theta.
00:19:23.105 --> 00:19:24.230
Everybody agrees with that?
00:19:27.320 --> 00:19:31.130
So this is actually nice,
because if I give you
00:19:31.130 --> 00:19:34.190
an exponential family, the only
thing I really need to tell
00:19:34.190 --> 00:19:36.210
you is what b theta is.
00:19:36.210 --> 00:19:38.780
And if I give you b of theta,
then computing a derivative
00:19:38.780 --> 00:19:42.470
is actually much easier
than having to integrate y
00:19:42.470 --> 00:19:44.000
against the density itself.
00:19:44.000 --> 00:19:46.010
You could really
have fun and try
00:19:46.010 --> 00:19:48.310
to compute this, which you
would be able to do, right?
00:19:54.080 --> 00:19:58.840
And then, there's the plus c
of y phi, blah, blah, blah--
00:19:58.840 --> 00:19:59.385
dy.
00:19:59.385 --> 00:20:01.760
And that's the way you would
actually compute this thing.
00:20:05.040 --> 00:20:06.680
Sorry, this guy is here.
00:20:06.680 --> 00:20:07.940
That would be painful.
00:20:07.940 --> 00:20:10.310
I don't know what this
normalization looks like,
00:20:10.310 --> 00:20:12.230
so it would have to
also explicit that,
00:20:12.230 --> 00:20:13.790
so I can actually
compute this thing.
00:20:13.790 --> 00:20:15.415
And you know, just
the same way, if you
00:20:15.415 --> 00:20:17.852
want to compute the
expectation of a Gaussian--
00:20:17.852 --> 00:20:19.310
well, the expectation
of a Gaussian
00:20:19.310 --> 00:20:20.750
is not the most difficult one.
00:20:20.750 --> 00:20:23.380
But even if you compute the
expectation of a Poisson,
00:20:23.380 --> 00:20:25.130
you start to have to
work in a little bit.
00:20:25.130 --> 00:20:27.200
There's a few things that
you have to work through.
00:20:27.200 --> 00:20:29.199
Here, I'm just telling
you, all you have to know
00:20:29.199 --> 00:20:30.740
is what b of theta
is, and then, you
00:20:30.740 --> 00:20:33.190
can just take the derivative.
00:20:33.190 --> 00:20:35.750
Let's see what the second
equality is going to give us.
00:20:56.490 --> 00:21:00.830
OK, so what is the
second equality?
00:21:00.830 --> 00:21:03.850
It's telling me that if I
look at the second derivative,
00:21:03.850 --> 00:21:07.410
and then I take its
expectation, I'm
00:21:07.410 --> 00:21:11.190
going to have something which
is equal to negative this guy
00:21:11.190 --> 00:21:13.059
squared.
00:21:13.059 --> 00:21:14.350
Sorry, that was the log, right?
00:21:19.970 --> 00:21:22.700
We've already computed
this first derivative
00:21:22.700 --> 00:21:24.380
of the likelihood.
00:21:24.380 --> 00:21:29.390
It's just the expectation of
the square of this thing here.
00:21:29.390 --> 00:21:34.070
So expectation of
the derivative,
00:21:34.070 --> 00:21:38.900
with respect to theta of
log, f theta of x, divided
00:21:38.900 --> 00:21:44.130
by partial theta squared.
00:21:44.130 --> 00:21:50.040
This is equal to the
expectation of the square of y,
00:21:50.040 --> 00:21:58.580
minus b theta, divided
by phi squared--
00:21:58.580 --> 00:21:59.720
b prime, theta squared.
00:22:04.375 --> 00:22:06.500
OK, sorry, I'm actually
going to move on with the--
00:22:11.100 --> 00:22:13.320
so if I start computing,
what is this thing?
00:22:13.320 --> 00:22:16.350
Well, we just agreed
that this was what?
00:22:19.580 --> 00:22:22.920
The expectation of theta, right?
00:22:22.920 --> 00:22:27.120
So that's just the
expectation of y.
00:22:27.120 --> 00:22:28.230
We just computed it here.
00:22:31.548 --> 00:22:32.970
AUDIENCE: [INAUDIBLE]
00:22:35.744 --> 00:22:37.410
PHILIPPE RIGOLLET:
Yeah, that's b prime.
00:22:37.410 --> 00:22:39.050
There's a derivative here.
00:22:44.760 --> 00:22:47.900
So now, this is what?
00:22:47.900 --> 00:22:57.680
This is simply-- anyone?
00:23:01.580 --> 00:23:02.810
I'm sorry?
00:23:02.810 --> 00:23:05.660
Variance of y, but you're
scaling by phi squared.
00:23:11.040 --> 00:23:15.390
OK, so this is negative
of the right-hand side
00:23:15.390 --> 00:23:17.250
of our inequality.
00:23:17.250 --> 00:23:21.810
And now, I just have to take
one more derivative to this guy.
00:23:21.810 --> 00:23:27.420
So now, if I look at
the left-hand side now,
00:23:27.420 --> 00:23:29.920
I have that the
second derivative
00:23:29.920 --> 00:23:38.890
of log, of f theta of y, divided
by partial of theta squared.
00:23:38.890 --> 00:23:40.680
So this thing is equal to--
00:23:40.680 --> 00:23:42.710
well, no, I'm not
left with much.
00:23:42.710 --> 00:23:44.630
The white part is
going to go away,
00:23:44.630 --> 00:23:47.590
and I'm left only with the
second derivative of theta,
00:23:47.590 --> 00:23:49.930
minus the second derivative
theta, divided by phi.
00:24:00.540 --> 00:24:03.790
So if I take expectation--
00:24:03.790 --> 00:24:05.360
well, it just doesn't change.
00:24:08.010 --> 00:24:09.320
This is deterministic.
00:24:09.320 --> 00:24:11.240
So now, what I've
established is that this guy
00:24:11.240 --> 00:24:14.010
is equal to negative this guy.
00:24:14.010 --> 00:24:17.050
So those two things, the
signs are going to go away.
00:24:17.050 --> 00:24:20.910
And so this implies
that the variance of y
00:24:20.910 --> 00:24:25.800
is equal to b prime prime theta.
00:24:25.800 --> 00:24:30.240
And then, I have a phi
square in denominator
00:24:30.240 --> 00:24:33.180
that cancels only one of the
phi squares, so it's time phi.
00:24:37.480 --> 00:24:41.140
So now, I have that my second
derivative-- since I know phi
00:24:41.140 --> 00:24:46.280
is completely
determining the variance.
00:24:46.280 --> 00:24:52.470
So basically, that's why b is
called the cumulant generating
00:24:52.470 --> 00:24:52.970
function.
00:24:52.970 --> 00:24:55.430
It's not generating
moments, but cumulants.
00:24:55.430 --> 00:24:59.180
But cumulants, in this
case, correspond, basically,
00:24:59.180 --> 00:25:01.060
to the moments, at
least for the first two.
00:25:01.060 --> 00:25:03.890
If I start going
farther, I'm going
00:25:03.890 --> 00:25:08.090
to have more combinations of
the expectation of y3, y2,
00:25:08.090 --> 00:25:09.530
and y itself.
00:25:13.150 --> 00:25:14.770
But as we know,
those are the ones
00:25:14.770 --> 00:25:17.170
that are usually the
most useful, at least
00:25:17.170 --> 00:25:19.384
if we're interested in
asymptotic performance.
00:25:19.384 --> 00:25:20.800
The central limit
theorem tells us
00:25:20.800 --> 00:25:23.380
that all that matters are
the first two moments,
00:25:23.380 --> 00:25:25.580
and then, the rest is just
going to go and say well,
00:25:25.580 --> 00:25:26.330
it doesn't matter.
00:25:26.330 --> 00:25:29.300
It's all going to
[INAUDIBLE] anyway.
00:25:29.300 --> 00:25:31.290
So let's go to a
Poisson for example.
00:25:31.290 --> 00:25:33.518
So if I had a Poisson
distribution--
00:25:39.910 --> 00:25:42.430
so this is a discrete
distribution.
00:25:42.430 --> 00:25:46.390
And what I know is that f--
00:25:46.390 --> 00:25:51.580
let me call mu the
parameter of y.
00:25:56.290 --> 00:26:01.870
So it's mu to the y, divided
by y factorial, exponential
00:26:01.870 --> 00:26:02.650
minus mu.
00:26:02.650 --> 00:26:04.570
OK so mu is usually
called lambda,
00:26:04.570 --> 00:26:06.294
and y is usually
called x, that's
00:26:06.294 --> 00:26:07.960
why it takes me to a
little bit of time.
00:26:07.960 --> 00:26:09.626
But it usually it's
lambda to the x over
00:26:09.626 --> 00:26:13.810
factorial x, exponential
minus lambda.
00:26:13.810 --> 00:26:16.490
Since this is just the series
expansion of the exponential
00:26:16.490 --> 00:26:19.230
when I sum those things
from 0 to infinity,
00:26:19.230 --> 00:26:20.949
this thing sums to 1.
00:26:20.949 --> 00:26:22.990
But then, if I wanted to
start understanding what
00:26:22.990 --> 00:26:25.900
the expectation
of this thing is--
00:26:25.900 --> 00:26:28.340
so if I want to understand
the expectation with respect
00:26:28.340 --> 00:26:33.820
to mu of y, then, I would
have to compute the sum
00:26:33.820 --> 00:26:48.280
from k equals 0 to infinity
of k, times mu to the k,
00:26:48.280 --> 00:26:51.587
over factorial of k,
exponential minus mu--
00:26:51.587 --> 00:26:53.170
which means that I
would, essentially,
00:26:53.170 --> 00:27:06.090
have to take the derivative
of my series in the end.
00:27:06.090 --> 00:27:07.174
So I can do this.
00:27:07.174 --> 00:27:08.340
This is a standard exercise.
00:27:08.340 --> 00:27:10.630
You've probably done it
when you took probability.
00:27:10.630 --> 00:27:12.900
But let's see if we can
actually just read it off
00:27:12.900 --> 00:27:14.760
from the first derivative of b.
00:27:14.760 --> 00:27:16.380
So to do that, we
need to write this
00:27:16.380 --> 00:27:18.850
in the form of an
exponential, where there
00:27:18.850 --> 00:27:23.220
is one parameter that captures
mu, that interacts with y, just
00:27:23.220 --> 00:27:25.860
doing this parameter times
y, and then something that
00:27:25.860 --> 00:27:29.430
depends only on y, and then
something that depends only
00:27:29.430 --> 00:27:32.979
on mu.
00:27:32.979 --> 00:27:34.020
That's the important one.
00:27:34.020 --> 00:27:35.550
That's going to be
our B. And then,
00:27:35.550 --> 00:27:39.150
there's going to be something
that depends only on y.
00:27:39.150 --> 00:27:42.990
So let's write this and
check that this f mu, indeed,
00:27:42.990 --> 00:27:46.510
belongs to this canonical
exponential family.
00:27:46.510 --> 00:27:48.600
So I definitely
have an exponential
00:27:48.600 --> 00:27:50.075
that comes from this guy.
00:27:50.075 --> 00:27:51.454
So I have minus mu.
00:27:51.454 --> 00:27:53.370
And then, this thing is
going to give me what?
00:27:53.370 --> 00:27:58.260
It's going to give
me plus y log mu.
00:27:58.260 --> 00:28:02.166
And then, I'm going to have
minus log of y factorial.
00:28:06.480 --> 00:28:08.790
So clearly, I have a
term that depends only
00:28:08.790 --> 00:28:11.340
on mu, terms that
depend only on y,
00:28:11.340 --> 00:28:15.300
and I have a product of y and
something that depends on mu.
00:28:15.300 --> 00:28:17.820
If I want to be
canonical, I must
00:28:17.820 --> 00:28:23.650
have this to be exactly
the parameter theta itself.
00:28:23.650 --> 00:28:27.150
So I'm going to
call this guy theta.
00:28:27.150 --> 00:28:30.750
So theta is log mu,
which means that mu
00:28:30.750 --> 00:28:32.592
is equal to e to the theta.
00:28:32.592 --> 00:28:34.050
And so wherever I
see mu, I'm going
00:28:34.050 --> 00:28:36.716
to replace it by [INAUDIBLE] the
theta, because my new parameter
00:28:36.716 --> 00:28:38.070
now, is theta.
00:28:38.070 --> 00:28:38.880
So this is what?
00:28:38.880 --> 00:28:43.490
This is equal to
exponential y times theta.
00:28:43.490 --> 00:28:47.860
And then, I'm going to
have minus e of theta.
00:28:47.860 --> 00:28:51.630
And then, who cares, something
that depends only on mu.
00:28:51.630 --> 00:28:58.330
So this is my c of y, and phi
is equal to 1 in this case.
00:28:58.330 --> 00:29:00.930
So that's all I care about.
00:29:00.930 --> 00:29:01.830
So let's use it.
00:29:05.000 --> 00:29:08.770
So this is my canonical
exponential family.
00:29:08.770 --> 00:29:11.680
Y interacts with theta
exactly like this.
00:29:11.680 --> 00:29:13.040
And then, I have this function.
00:29:13.040 --> 00:29:17.410
So this function here
must be b of theta.
00:29:20.080 --> 00:29:22.780
So from this function,
exponential theta,
00:29:22.780 --> 00:29:25.000
I'm supposed to be able
to read what the mean is.
00:29:39.820 --> 00:29:41.796
AUDIENCE: [INAUDIBLE]
00:29:43.772 --> 00:29:46.990
PHILIPPE RIGOLLET: Because
since in this course
00:29:46.990 --> 00:29:48.610
I always know what
the dispersion is,
00:29:48.610 --> 00:29:52.450
I can actually always absorb
it into theta from one.
00:29:52.450 --> 00:29:54.670
But here, it's really of
the form y times something
00:29:54.670 --> 00:29:55.720
divided by 1, right?
00:30:01.030 --> 00:30:04.805
If it was like log
of mu divided by phi,
00:30:04.805 --> 00:30:06.430
that would be the
question of whether I
00:30:06.430 --> 00:30:10.300
want to call phi my
dispersion, or if I
00:30:10.300 --> 00:30:12.070
want to just have it in there.
00:30:16.610 --> 00:30:18.740
This makes no
difference in practice.
00:30:18.740 --> 00:30:20.860
But the real thing
is it's never going
00:30:20.860 --> 00:30:22.610
to happen that this
thing, this version,
00:30:22.610 --> 00:30:23.960
is going to be an exact number.
00:30:23.960 --> 00:30:26.240
If it's an actual
numerical number,
00:30:26.240 --> 00:30:28.580
this just means that this
number should be absorbed
00:30:28.580 --> 00:30:32.120
in the definition of theta.
00:30:32.120 --> 00:30:34.700
But if it's something
that is called sigma, say,
00:30:34.700 --> 00:30:36.470
and I will assume
that sigma is known,
00:30:36.470 --> 00:30:39.162
then it's probably preferable
to keep it in the dispersion,
00:30:39.162 --> 00:30:41.120
so you can see that
there's this parameter here
00:30:41.120 --> 00:30:44.450
that you can,
essentially, play with.
00:30:44.450 --> 00:30:48.810
It doesn't make any
difference when you know phi.
00:30:48.810 --> 00:30:55.050
So now, if I look at the
expectation of some y-- so now,
00:30:55.050 --> 00:31:00.419
I'm going to have y, which
follows my Poisson mu.
00:31:00.419 --> 00:31:01.960
I'm going to look
at the expectation,
00:31:01.960 --> 00:31:09.210
and I know that the expectation
is b prime of theta.
00:31:09.210 --> 00:31:09.950
Agreed?
00:31:09.950 --> 00:31:14.780
That's what I just
erased, I think.
00:31:14.780 --> 00:31:17.020
Agreed with this,
the derivative?
00:31:17.020 --> 00:31:18.550
So what is this?
00:31:18.550 --> 00:31:23.050
Well, it's the derivative
of e to the theta, which
00:31:23.050 --> 00:31:27.270
is e to the theta, which is mu.
00:31:27.270 --> 00:31:30.030
So my Poisson is
parametrized by its mean.
00:31:30.030 --> 00:31:34.850
I can also compute
the variance, which
00:31:34.850 --> 00:31:40.580
is equal to minus the
second derivative of--
00:31:40.580 --> 00:31:42.460
no, it's equal to the
second derivative of b.
00:31:47.170 --> 00:31:49.090
Dispersion is equal to 1.
00:31:49.090 --> 00:31:55.000
Again, if I took phi elsewhere,
I would see it here as well.
00:31:55.000 --> 00:31:57.760
So if I just absorbed phi here,
I would see it divided here,
00:31:57.760 --> 00:32:00.040
so it would not
make any difference.
00:32:00.040 --> 00:32:02.925
And what is the second
derivative of the exponential?
00:32:06.570 --> 00:32:09.820
Still the exponential--
so it's still equal to mu.
00:32:14.760 --> 00:32:17.950
So that certainly
makes our life easier.
00:32:17.950 --> 00:32:19.620
Just one quick from remark--
00:32:31.130 --> 00:32:32.360
here's the function.
00:32:32.360 --> 00:32:35.150
I am giving you problem--
00:32:35.150 --> 00:32:36.710
can the b function--
00:32:39.320 --> 00:32:46.550
can it ever be equal
to log of theta?
00:32:55.840 --> 00:32:58.310
Who says yes?
00:32:58.310 --> 00:33:00.428
Who says no?
00:33:00.428 --> 00:33:02.858
Why?
00:33:02.858 --> 00:33:04.802
AUDIENCE: [INAUDIBLE]
00:33:09.680 --> 00:33:13.670
PHILIPPE RIGOLLET: Yeah, so
what I've learned from this--
00:33:13.670 --> 00:33:16.610
it's sort of completely
analytic, right?
00:33:16.610 --> 00:33:19.640
So we just took derivatives,
and this thing just happened.
00:33:19.640 --> 00:33:22.490
This thing actually allowed us
to relate the second derivative
00:33:22.490 --> 00:33:24.299
of b to the variance.
00:33:24.299 --> 00:33:26.090
And one thing that we
know about a variance
00:33:26.090 --> 00:33:27.920
is that this is non-negative.
00:33:27.920 --> 00:33:30.830
And in particular,
it's always positive.
00:33:30.830 --> 00:33:35.330
If they give you a canonical
exponential family that
00:33:35.330 --> 00:33:39.260
has zero variance, trust
me, you will see it.
00:33:39.260 --> 00:33:40.919
That means that
this thing is not
00:33:40.919 --> 00:33:42.710
going to look like
something that's finite,
00:33:42.710 --> 00:33:44.280
and it's going to
have a point mass.
00:33:44.280 --> 00:33:46.280
It's going to take value
infinity at one point.
00:33:46.280 --> 00:33:48.080
So this will,
basically, never happen.
00:33:48.080 --> 00:33:50.220
This thing is, actually,
strictly positive,
00:33:50.220 --> 00:33:52.600
which means that this thing
is always strictly concave.
00:33:52.600 --> 00:33:55.220
It means that the second
derivative of this function, b,
00:33:55.220 --> 00:34:00.440
has to be strictly positive, and
so that the function is convex.
00:34:00.440 --> 00:34:03.005
So this is concave, so this
is definitely not working.
00:34:03.005 --> 00:34:04.880
I need to have something
that looks like this
00:34:04.880 --> 00:34:07.920
when I talk about my b.
00:34:07.920 --> 00:34:10.500
So f theta squared--
00:34:10.500 --> 00:34:13.190
we'll see a bunch of
exponential theta.
00:34:13.190 --> 00:34:14.389
And there's a bunch of them.
00:34:14.389 --> 00:34:18.280
But if you started writing
something, and you find b--
00:34:18.280 --> 00:34:20.230
try to think of the
plot of b in your mind,
00:34:20.230 --> 00:34:23.980
and you find that b looks like
it's going to become concave,
00:34:23.980 --> 00:34:25.844
you've made a sign
mistake somewhere.
00:34:30.110 --> 00:34:33.320
All right, so we've done
a pretty big parenthesis
00:34:33.320 --> 00:34:37.040
to try to characterize
what the distribution of y
00:34:37.040 --> 00:34:37.820
was going to be.
00:34:37.820 --> 00:34:41.679
We wanted to extend from, say,
Gaussian to something else.
00:34:41.679 --> 00:34:43.909
But when we're doing
regression, which
00:34:43.909 --> 00:34:46.520
means generalized
linear models, we
00:34:46.520 --> 00:34:48.500
are not interested in
the distribution of y
00:34:48.500 --> 00:34:51.650
but really the conditional
distribution of y given x.
00:34:51.650 --> 00:34:55.760
So I need now to couple
those back together.
00:34:55.760 --> 00:34:59.702
So what I know is that
this same mu, in this case,
00:34:59.702 --> 00:35:01.910
which is the expectation--
what I want to say is that
00:35:01.910 --> 00:35:09.740
the conditional
expectation of y given x--
00:35:12.710 --> 00:35:15.870
this is some mu of x.
00:35:15.870 --> 00:35:17.700
When we did linear
models, we said well,
00:35:17.700 --> 00:35:21.870
this thing was some x transpose
beta for linear models.
00:35:26.139 --> 00:35:27.680
And the whole premise
of this chapter
00:35:27.680 --> 00:35:29.900
is to say well, this
might make no sense,
00:35:29.900 --> 00:35:32.930
because x transpose beta
can take the entire range
00:35:32.930 --> 00:35:34.320
of real values.
00:35:34.320 --> 00:35:36.870
Whereas, this mu can take
only a partial range.
00:35:36.870 --> 00:35:40.550
So even if you actually focus
on the Poisson, for example,
00:35:40.550 --> 00:35:43.970
we know that the expectation
of a Poisson has to be
00:35:43.970 --> 00:35:45.910
a non-negative number--
00:35:45.910 --> 00:35:47.660
actually, a positive
number as soon as you
00:35:47.660 --> 00:35:49.970
have a little bit of variance.
00:35:49.970 --> 00:35:52.590
It's mu itself-- mu
is a positive number.
00:35:52.590 --> 00:35:54.800
And so it's not going
to make any sense
00:35:54.800 --> 00:35:57.710
to assume that mu of x is
equal to x transpose beta,
00:35:57.710 --> 00:36:00.710
because you might find some x's
for which this value ends up
00:36:00.710 --> 00:36:02.302
being negative.
00:36:02.302 --> 00:36:03.760
And so we're going
to need, what we
00:36:03.760 --> 00:36:05.860
call, the link
function that relates,
00:36:05.860 --> 00:36:08.560
that transforms mu,
maps onto the real line,
00:36:08.560 --> 00:36:13.210
so that you can now express it
of the form x transpose beta.
00:36:13.210 --> 00:36:17.560
So we're going to take
not this, but we're
00:36:17.560 --> 00:36:21.250
going to assume
that g of mu of x
00:36:21.250 --> 00:36:24.430
is not equal to
x transpose beta,
00:36:24.430 --> 00:36:26.140
and that's the
generalized linear models.
00:36:33.220 --> 00:36:40.650
So as I said, it's weird to
transform x transpose beta--
00:36:40.650 --> 00:36:43.420
a mu to make it
take the real line.
00:36:43.420 --> 00:36:44.920
At least to me, it
feels a bit more
00:36:44.920 --> 00:36:47.530
natural to take x
transpose beta and make
00:36:47.530 --> 00:36:51.000
it fit to the particular
distribution that I want.
00:36:51.000 --> 00:36:53.890
And so I'm going to want to
talk about g and g inverse
00:36:53.890 --> 00:36:55.550
at the same time.
00:36:55.550 --> 00:36:59.070
So I'm going to
actually take always g.
00:36:59.070 --> 00:37:04.920
So g is my link
function, and I'm
00:37:04.920 --> 00:37:10.036
going to want g to be
continuous differentiable.
00:37:16.980 --> 00:37:19.020
OK, let's say that
it has a derivative,
00:37:19.020 --> 00:37:22.630
and its derivative
is continuous.
00:37:22.630 --> 00:37:28.398
And I'm going to want g
to be strictly increasing.
00:37:34.770 --> 00:37:39.930
And that actually implies
that g inverse exists.
00:37:39.930 --> 00:37:43.410
Actually, that's not true.
00:37:43.410 --> 00:37:54.505
What I'm also going to want
is that g of mu spans--
00:37:57.420 --> 00:37:58.380
how do I do this?
00:38:06.090 --> 00:38:09.720
So I want the g, as I arrange
for all possible values of mu,
00:38:09.720 --> 00:38:11.220
whether they're all
positive values,
00:38:11.220 --> 00:38:12.969
or whether they're
values that are limited
00:38:12.969 --> 00:38:15.240
between the intervals 0,
1, I want those to span
00:38:15.240 --> 00:38:18.340
the entire real line, so that
when I want to talk about g
00:38:18.340 --> 00:38:20.282
inverses define over
the entire real line,
00:38:20.282 --> 00:38:21.240
I know where I started.
00:38:24.396 --> 00:38:26.660
So this implies that
gene inverse exists.
00:38:30.200 --> 00:38:32.388
What else does it
imply about g inverse?
00:38:39.500 --> 00:38:41.270
So for a function
to be invertible,
00:38:41.270 --> 00:38:43.855
I only need for it to
be strictly monotone.
00:38:43.855 --> 00:38:45.605
I don't need it to be
strictly increasing.
00:38:45.605 --> 00:38:47.729
So in particular, the fact
that I picked increasing
00:38:47.729 --> 00:38:53.360
implies that this guy
is actually increasing.
00:38:53.360 --> 00:38:54.820
AUDIENCE: [INAUDIBLE]
00:38:54.820 --> 00:38:56.320
PHILIPPE RIGOLLET:
That's the image.
00:39:03.470 --> 00:39:06.830
So this is my link function, and
this slide is just telling me
00:39:06.830 --> 00:39:08.330
I want my function
to be invertible,
00:39:08.330 --> 00:39:09.890
so I can talk about g inverse.
00:39:09.890 --> 00:39:12.510
I'm going to switch
between the two.
00:39:12.510 --> 00:39:15.450
So what link functions
am I going to get?
00:39:15.450 --> 00:39:17.214
So for linear
models, we just said
00:39:17.214 --> 00:39:18.630
there's no link
function, which is
00:39:18.630 --> 00:39:20.962
the same as saying that the
link function is identity,
00:39:20.962 --> 00:39:22.920
which certainly satisfies
all these conditions.
00:39:22.920 --> 00:39:23.735
It's invertible.
00:39:23.735 --> 00:39:25.110
It has all these
nice properties,
00:39:25.110 --> 00:39:27.540
but might as well
not talk about it.
00:39:27.540 --> 00:39:29.220
For Poisson data,
when we assume that
00:39:29.220 --> 00:39:32.250
the conditional distribution
of y given x is Poisson,
00:39:32.250 --> 00:39:37.200
the mu, as I just said, is
required to be positive.
00:39:37.200 --> 00:39:43.650
So I need a g that goes
from the interval 0 infinity
00:39:43.650 --> 00:39:45.540
to the entire real line.
00:39:45.540 --> 00:39:47.910
I need a function that
starts from one end
00:39:47.910 --> 00:39:51.720
and just takes-- not
only the positive values
00:39:51.720 --> 00:39:54.580
are split between positive
and negative values.
00:39:54.580 --> 00:39:56.820
And here, for example, I
could take the log link.
00:39:56.820 --> 00:40:01.260
So the log is defined
on this entire interval.
00:40:01.260 --> 00:40:04.050
And as I range from
0 to plus infinity,
00:40:04.050 --> 00:40:07.440
the log is ranging from negative
infinity to plus infinity.
00:40:10.382 --> 00:40:12.090
You can probably think
of other functions
00:40:12.090 --> 00:40:15.510
that do that, like 2 times log.
00:40:15.510 --> 00:40:16.860
That's another one.
00:40:16.860 --> 00:40:20.170
But there's many others
you can think of.
00:40:20.170 --> 00:40:21.752
But let's say the
log is one of them
00:40:21.752 --> 00:40:23.210
that you might want
to think about.
00:40:32.680 --> 00:40:34.410
It is unnatural in
the sense that it's
00:40:34.410 --> 00:40:36.160
one of the first
function we can think of.
00:40:36.160 --> 00:40:39.840
We will see, also, that it has
another canonical property that
00:40:39.840 --> 00:40:42.090
makes it a natural choice.
00:40:42.090 --> 00:40:44.130
The other one is
the other example,
00:40:44.130 --> 00:40:47.520
where we had an even stronger
condition on what mu could be.
00:40:47.520 --> 00:40:49.620
Mu could only be a
number between 0 and 1,
00:40:49.620 --> 00:40:52.780
that was the probability
of success of a coin flip--
00:40:52.780 --> 00:40:55.290
probability of success of a
Bernoulli random variable.
00:40:55.290 --> 00:40:59.310
And now, I need g to map 0,
1 to the entire real line.
00:40:59.310 --> 00:41:02.490
And so here are
a bunch of things
00:41:02.490 --> 00:41:04.980
that you can come up
with, because now you
00:41:04.980 --> 00:41:08.220
start to have maybe--
00:41:08.220 --> 00:41:11.340
I will soon claim that
this one, log of mu,
00:41:11.340 --> 00:41:14.220
divided by 1 minus mu
is the most natural one.
00:41:14.220 --> 00:41:16.770
But maybe, if you had
never thought of this,
00:41:16.770 --> 00:41:18.780
that might not be
the first function
00:41:18.780 --> 00:41:20.490
you would come up with, right?
00:41:20.490 --> 00:41:23.670
You mentioned trigonometric
functions, for example,
00:41:23.670 --> 00:41:25.980
so maybe, you can
come up with something
00:41:25.980 --> 00:41:30.960
that comes from hyperbolic
trigonometry or something.
00:41:30.960 --> 00:41:32.329
So what does this function do?
00:41:32.329 --> 00:41:34.370
Well, we'll see a picture,
but this function does
00:41:34.370 --> 00:41:36.990
map the interval 0, 1
to the entire real line.
00:41:36.990 --> 00:41:40.770
We also discuss the fact that
if we think reciprocally--
00:41:43.740 --> 00:41:46.110
what I want if I want to
think about g inverse,
00:41:46.110 --> 00:41:49.140
I want a function that maps the
entire real line into the unit
00:41:49.140 --> 00:41:49.920
interval.
00:41:49.920 --> 00:41:52.650
And as we said, if I'm not
a very creative statistician
00:41:52.650 --> 00:41:55.960
or probabilist, I can just
pick my favorite continuous,
00:41:55.960 --> 00:41:59.710
strictly increasing cumulative
distribution function,
00:41:59.710 --> 00:42:01.350
which as we know,
will arise as soon
00:42:01.350 --> 00:42:03.060
as I have a density
that has support
00:42:03.060 --> 00:42:04.830
on the entire real line.
00:42:04.830 --> 00:42:07.350
If I have support everywhere,
then it means that my--
00:42:12.100 --> 00:42:14.140
it is strictly positive
everywhere, then,
00:42:14.140 --> 00:42:17.070
it means that my community
distribution function
00:42:17.070 --> 00:42:18.690
has to be strictly increasing.
00:42:18.690 --> 00:42:21.450
And of course, it has to go
from 0 to 1, because that's just
00:42:21.450 --> 00:42:22.717
the nature of those things.
00:42:22.717 --> 00:42:24.550
And so for example, I
can take the Gaussian,
00:42:24.550 --> 00:42:25.980
that's one such function.
00:42:25.980 --> 00:42:28.591
But I could also take
the double exponential
00:42:28.591 --> 00:42:30.340
that looks like an
exponential on one end,
00:42:30.340 --> 00:42:32.460
and then an exponential
on the other end.
00:42:32.460 --> 00:42:39.930
And basically, if you
take capital phi, which
00:42:39.930 --> 00:42:43.560
is the standard Gaussian
cumulative distribution
00:42:43.560 --> 00:42:47.460
function, it does work for you,
and you can take its inverse.
00:42:47.460 --> 00:42:49.160
And in this case,
we don't talk about,
00:42:49.160 --> 00:42:51.810
so this guy is called
logit or logit.
00:42:51.810 --> 00:42:53.172
And this guy is called probit.
00:42:53.172 --> 00:42:54.630
And you see it,
usually, every time
00:42:54.630 --> 00:42:58.534
you have a package on
generalized linear models.
00:42:58.534 --> 00:42:59.700
You are trying to implement.
00:42:59.700 --> 00:43:00.970
You have this choice.
00:43:00.970 --> 00:43:04.009
And for what's called logistic
regression-- so it's funny
00:43:04.009 --> 00:43:05.550
that it's called
logistic regression,
00:43:05.550 --> 00:43:07.410
but you can actually
use the probit link,
00:43:07.410 --> 00:43:10.620
which in this case, is
called probit regression.
00:43:10.620 --> 00:43:12.480
But those things are
essentially equivalent,
00:43:12.480 --> 00:43:14.816
and it's really a
matter of taste.
00:43:14.816 --> 00:43:16.440
Maybe of communities--
some communities
00:43:16.440 --> 00:43:18.140
might prefer one over the other.
00:43:18.140 --> 00:43:20.490
We'll see that
again, as I claimed
00:43:20.490 --> 00:43:24.810
before, the logistic,
the logit one
00:43:24.810 --> 00:43:29.130
has a slightly more compelling
argument for its reason
00:43:29.130 --> 00:43:30.152
to exist.
00:43:30.152 --> 00:43:31.860
I guess this one, the
compelling argument
00:43:31.860 --> 00:43:34.770
is that it involved the
standard Gaussian, which
00:43:34.770 --> 00:43:37.470
of course, is something that
should show up everywhere.
00:43:37.470 --> 00:43:41.340
And then, you can think
about crazy stuff.
00:43:41.340 --> 00:43:43.770
Even crazy gets name--
00:43:43.770 --> 00:43:48.670
complimentary log, log, which is
the log of minus, log 1 minus.
00:43:48.670 --> 00:43:49.170
Why not?
00:43:52.890 --> 00:43:56.450
So I guess you can
iterate that thing.
00:43:56.450 --> 00:43:59.510
You can just put a log 1
minus in front of this thing,
00:43:59.510 --> 00:44:01.160
and it's still going to go.
00:44:01.160 --> 00:44:07.810
So that's not true.
00:44:07.810 --> 00:44:10.377
I have to put a minus and take--
00:44:10.377 --> 00:44:11.210
no, that's not true.
00:44:13.707 --> 00:44:15.290
So you can think of
whatever you want.
00:44:19.320 --> 00:44:21.770
So I claimed that the logit
link is the natural choice, so
00:44:21.770 --> 00:44:22.970
here's a picture.
00:44:22.970 --> 00:44:25.590
I should have actually
plotted the other one,
00:44:25.590 --> 00:44:27.399
so we can actually compare it.
00:44:27.399 --> 00:44:29.690
To be fair, I don't even
remember how it would actually
00:44:29.690 --> 00:44:32.010
fit into those two functions.
00:44:32.010 --> 00:44:35.712
So the blue one, which is
this one, for those of you
00:44:35.712 --> 00:44:37.670
don't see the difference
between blue and red--
00:44:37.670 --> 00:44:39.300
sorry about that.
00:44:39.300 --> 00:44:45.320
So this the blue one
is the logistic one.
00:44:45.320 --> 00:44:49.980
So this guy is the function that
does e to the x, over 1 plus
00:44:49.980 --> 00:44:50.480
e to the x.
00:44:50.480 --> 00:44:51.560
As you can see,
this is a function
00:44:51.560 --> 00:44:53.600
that's supposed to map
the entire real line
00:44:53.600 --> 00:44:55.970
into the intervals, 0, 1.
00:44:55.970 --> 00:44:58.220
So that's supposed to be the
inverse of your function,
00:44:58.220 --> 00:45:00.580
and I claimed that this is
the inverse of the logistic
00:45:00.580 --> 00:45:02.090
of the logit function.
00:45:02.090 --> 00:45:04.340
And the blue one, well,
this is the Gaussian CDF,
00:45:04.340 --> 00:45:06.470
so you know it's clearly
the inverse of the inverse
00:45:06.470 --> 00:45:07.732
of the Gaussian CDF.
00:45:07.732 --> 00:45:08.690
And that's the red one.
00:45:08.690 --> 00:45:09.939
That's the one that goes here.
00:45:12.290 --> 00:45:15.320
I would guess that the
complimentary log, log is
00:45:15.320 --> 00:45:17.600
something that's probably
going above here,
00:45:17.600 --> 00:45:19.790
and for which the
slope is, actually,
00:45:19.790 --> 00:45:22.840
even a little flatter
as you cross 0.
00:45:26.250 --> 00:45:29.119
So of course, this is
not our link functions.
00:45:29.119 --> 00:45:30.910
These are the inverse
of our link function.
00:45:30.910 --> 00:45:32.730
So what do they look
like when actually,
00:45:32.730 --> 00:45:36.670
basically, flip my
thing like this?
00:45:36.670 --> 00:45:38.810
So this is what I see.
00:45:38.810 --> 00:45:42.600
And so I can see that in blue,
this is my logistic link.
00:45:42.600 --> 00:45:46.140
So it crosses 0 with a
slightly faster rate.
00:45:46.140 --> 00:45:49.830
Remember, if we could
use the identity, that
00:45:49.830 --> 00:45:51.134
would be very nice to us.
00:45:51.134 --> 00:45:52.800
We would just want
to take the identity.
00:45:52.800 --> 00:45:55.145
The problem is that
if I start having
00:45:55.145 --> 00:45:56.520
the identity that
goes here, it's
00:45:56.520 --> 00:45:58.810
going to start being a problem.
00:45:58.810 --> 00:46:06.419
And this is the probit link,
the phi verse that you see here.
00:46:06.419 --> 00:46:07.335
It's a little flatter.
00:46:10.290 --> 00:46:16.599
You can compute the derivative
at zero of those guys.
00:46:16.599 --> 00:46:17.890
What is the derivative of the--
00:46:21.180 --> 00:46:24.380
So I'm taking the derivative
of log of x over 1 minus x.
00:46:24.380 --> 00:46:32.010
So it's 1 over x,
minus 1 over 1 minus x.
00:46:35.850 --> 00:46:39.120
So if I look at 0.5--
00:46:39.120 --> 00:46:40.770
sorry, this is
the interval 0, 1.
00:46:40.770 --> 00:46:48.070
So I'm interested
in the slope at 0.5.
00:46:48.070 --> 00:46:49.660
Yes, it's plus, thank you.
00:46:49.660 --> 00:46:53.230
So at 0.5, what I
get is 2 plus 2.
00:46:57.090 --> 00:47:02.650
Yeah, so that's the
slope that we get,
00:47:02.650 --> 00:47:07.434
and if you compute
for the derivative--
00:47:07.434 --> 00:47:09.100
what is the derivative
of a phi inverse?
00:47:09.100 --> 00:47:13.180
Well, it's a little
phi of x, divided
00:47:13.180 --> 00:47:20.640
by little phi of capital
phi, inverse of x.
00:47:20.640 --> 00:47:24.319
So little phi at 1/2--
00:47:24.319 --> 00:47:24.860
I don't know.
00:47:29.450 --> 00:47:30.950
Yeah, I guess I can
probably compute
00:47:30.950 --> 00:47:32.590
the derivative of
the capital phi
00:47:32.590 --> 00:47:34.460
at 0, which is going
to be just that.
00:47:34.460 --> 00:47:37.070
1 over square root of 2
pi, and then just say well,
00:47:37.070 --> 00:47:38.870
the slope has to be 1 over that.
00:47:42.972 --> 00:47:43.680
Square root 2 pi.
00:47:47.130 --> 00:47:50.310
So that's just a comparison,
but again, so far, we
00:47:50.310 --> 00:47:54.151
do not have any reason to
prefer one to the other.
00:47:54.151 --> 00:47:56.400
And so now, I'm going to
start giving you some reasons
00:47:56.400 --> 00:47:58.110
to prefer one to the other.
00:47:58.110 --> 00:48:01.260
And one of those two--
00:48:01.260 --> 00:48:03.570
and actually for each
canonical family,
00:48:03.570 --> 00:48:05.820
there is something which is
called the canonical link.
00:48:05.820 --> 00:48:07.486
And when you don't
have any other reason
00:48:07.486 --> 00:48:10.386
to choose anything else, why
not choose the canonical one?
00:48:10.386 --> 00:48:11.760
And the canonical
link is the one
00:48:11.760 --> 00:48:19.580
that says OK, what I want is g
to map mu onto the real line.
00:48:22.980 --> 00:48:28.550
But mu is not the parameter
of my canonical family.
00:48:28.550 --> 00:48:31.470
Here for example,
mu is e of theta,
00:48:31.470 --> 00:48:33.290
but the canonical
parameter is theta.
00:48:36.050 --> 00:48:39.480
But the parameter of a
canonical exponential family
00:48:39.480 --> 00:48:42.650
is something that lives
in the entire real line.
00:48:42.650 --> 00:48:45.510
It was defined for all thetas.
00:48:45.510 --> 00:48:50.250
And so in particular,
I can just take theta
00:48:50.250 --> 00:48:52.950
to be the one that's
x transpose beta.
00:48:52.950 --> 00:48:54.480
And so in particular,
I'm just going
00:48:54.480 --> 00:48:57.180
to try to find the link
that just says OK, when
00:48:57.180 --> 00:49:00.470
I take g of mu,
I'm going to map,
00:49:00.470 --> 00:49:01.700
so that's what's going to be.
00:49:01.700 --> 00:49:05.499
So I know that g of mu is
going to be equal to x beta.
00:49:05.499 --> 00:49:07.040
And now, what I'm
going to say is OK,
00:49:07.040 --> 00:49:09.850
let's just take the g that
makes this guy equal to theta,
00:49:09.850 --> 00:49:11.600
so that this is theta
that actually model,
00:49:11.600 --> 00:49:14.880
like x transpose beta.
00:49:14.880 --> 00:49:17.960
Feels pretty canonical, right?
00:49:17.960 --> 00:49:19.880
What else?
00:49:19.880 --> 00:49:22.280
What other central, easy
choice would you take?
00:49:22.280 --> 00:49:23.650
This was pretty easy.
00:49:23.650 --> 00:49:27.560
There is a natural parameter
for this canonical family,
00:49:27.560 --> 00:49:29.780
and it takes value on
the entire real line.
00:49:29.780 --> 00:49:32.500
I have a function that maps
mu onto the entire real line,
00:49:32.500 --> 00:49:36.260
so let's just map it to
the actual parameter.
00:49:36.260 --> 00:49:40.419
So now, OK, why do I have this?
00:49:40.419 --> 00:49:41.960
Well, we've already
figured that out.
00:49:41.960 --> 00:49:44.840
The canonical link function
is strictly increasing.
00:49:44.840 --> 00:49:49.670
Sorry, so I said that
now I want this guy--
00:49:49.670 --> 00:49:57.470
so I want g of mu to
be equal to theta,
00:49:57.470 --> 00:50:00.560
which is equivalent to
saying that I want mu to be
00:50:00.560 --> 00:50:03.860
equal to g inverse of theta.
00:50:03.860 --> 00:50:08.140
But we know that mu is what--
00:50:08.140 --> 00:50:09.160
b prime of theta.
00:50:15.640 --> 00:50:21.090
So that means that b prime is
the same function as g inverse.
00:50:21.090 --> 00:50:24.570
And I claimed that this is
actually giving me, indeed,
00:50:24.570 --> 00:50:27.930
a function that has the
properties that I want,
00:50:27.930 --> 00:50:30.060
because before I said,
just pick any function that
00:50:30.060 --> 00:50:31.080
has these properties.
00:50:31.080 --> 00:50:33.102
And now, I'm giving
you a very hard rule
00:50:33.102 --> 00:50:34.560
to pick this, though
you need still
00:50:34.560 --> 00:50:37.050
to check that it satisfies
those conditions and particular,
00:50:37.050 --> 00:50:39.050
that it's increasing
and invertible.
00:50:39.050 --> 00:50:41.050
And so for this to be
increasing and invertible,
00:50:41.050 --> 00:50:42.630
strictly increasing
and invertible,
00:50:42.630 --> 00:50:44.880
really what I need is that
the inverse is strictly
00:50:44.880 --> 00:50:48.070
increasing and invertible,
which is the case here,
00:50:48.070 --> 00:50:51.220
because b prime as we said--
00:50:51.220 --> 00:50:56.610
well, b prime is the derivative
of a strictly convex function.
00:50:56.610 --> 00:50:58.749
A strictly convex function
has a second derivative
00:50:58.749 --> 00:50:59.790
that's strictly positive.
00:50:59.790 --> 00:51:01.770
We just figured that
out using the fact
00:51:01.770 --> 00:51:03.790
that the variance was
strictly positive.
00:51:03.790 --> 00:51:06.330
And if phi is strictly
positive, then this thing
00:51:06.330 --> 00:51:08.530
has to be strictly positive.
00:51:08.530 --> 00:51:10.650
So if b prime, prime
is strictly positive--
00:51:10.650 --> 00:51:13.604
this is the derivative of
a function called b prime.
00:51:13.604 --> 00:51:15.270
If your derivative
is strictly positive,
00:51:15.270 --> 00:51:17.670
you are strictly increasing.
00:51:17.670 --> 00:51:22.810
And so we know that b prime is,
indeed, strictly increasing.
00:51:22.810 --> 00:51:26.060
And what I need also
to check-- well,
00:51:26.060 --> 00:51:28.010
I guess this is already
checked on its own,
00:51:28.010 --> 00:51:33.560
because b prime is
actually mapping all of our
00:51:33.560 --> 00:51:37.100
into the possible values.
00:51:37.100 --> 00:51:38.870
When theta ranges on
the entire real line,
00:51:38.870 --> 00:51:41.120
then b prime ranges
in the entire interval
00:51:41.120 --> 00:51:45.440
of the mean values
that it can take.
00:51:45.440 --> 00:51:48.030
And so now, I have this thing
that's completely defined.
00:51:48.030 --> 00:51:50.490
B prime inverse is a valid link.
00:51:56.030 --> 00:51:57.450
And it's called
a canonical link.
00:52:02.470 --> 00:52:05.770
OK, so again, if I give you
an exponential family, which
00:52:05.770 --> 00:52:09.350
is another way of saying I'll
give you a convex function, b,
00:52:09.350 --> 00:52:12.400
which gives you some
exponential family,
00:52:12.400 --> 00:52:15.160
then if you just
take b prime inverse,
00:52:15.160 --> 00:52:17.770
this gives you the
associated canonical link
00:52:17.770 --> 00:52:21.590
for this canonical
exponential family.
00:52:21.590 --> 00:52:26.196
So clearly there's
an advantage of doing
00:52:26.196 --> 00:52:28.070
this, which is I don't
have to actually think
00:52:28.070 --> 00:52:30.800
about which one to pick if I
don't want to think about it.
00:52:30.800 --> 00:52:34.220
But there's other
advantages that come to it,
00:52:34.220 --> 00:52:36.170
and we'll see that in
the representations.
00:52:36.170 --> 00:52:38.711
There's, basically, going to be
some light cancellations that
00:52:38.711 --> 00:52:39.290
show up.
00:52:39.290 --> 00:52:40.665
So before we go
there, let's just
00:52:40.665 --> 00:52:43.790
compute the canonical link for
the Bernoulli distribution.
00:52:43.790 --> 00:52:46.360
So remember, the
Bernoulli distribution
00:52:46.360 --> 00:52:55.510
has a PMF, which is part of the
canonical exponential family.
00:52:55.510 --> 00:53:00.180
So the PMF of the
Bernoulli is f theta of x.
00:53:06.529 --> 00:53:07.820
Let me just write it like this.
00:53:07.820 --> 00:53:12.470
So it's p to the y,
let's say-- one minus p
00:53:12.470 --> 00:53:16.700
to the 1 minus y,
which I will write
00:53:16.700 --> 00:53:28.910
as exponential y log p, plus
1 minus y, log 1 minus p.
00:53:28.910 --> 00:53:30.750
OK, we've done that last time.
00:53:30.750 --> 00:53:32.670
Now, I'm going to
group my terms in y
00:53:32.670 --> 00:53:37.530
to see how y interacts
with this parameter p.
00:53:37.530 --> 00:53:40.710
And what I'm getting is
y, which is times log p
00:53:40.710 --> 00:53:42.540
divided by 1 minus p.
00:53:42.540 --> 00:53:47.040
And then, the only term that
remains is log, 1 minus p.
00:53:50.370 --> 00:53:53.710
Now, I want this to be a
canonical exponential family,
00:53:53.710 --> 00:53:56.880
which means that I just
need to call this guy,
00:53:56.880 --> 00:53:58.710
so it is part of the
exponential family.
00:53:58.710 --> 00:53:59.460
You can read that.
00:53:59.460 --> 00:54:04.520
If I want it to be canonical,
this guy must be theta itself.
00:54:04.520 --> 00:54:11.180
So I have that theta is
equal to log p, 1 minus p.
00:54:11.180 --> 00:54:12.800
If I invert this
thing, it tells me
00:54:12.800 --> 00:54:16.880
that p is e to the theta,
divided by 1, plus e
00:54:16.880 --> 00:54:18.434
to the theta.
00:54:18.434 --> 00:54:19.850
It's just inverting
this function.
00:54:23.550 --> 00:54:28.140
In particular, it means
that log, 1 minus p,
00:54:28.140 --> 00:54:31.900
is equal to log, 1
minus this thing.
00:54:31.900 --> 00:54:33.520
So the exponential
thetas go away.
00:54:33.520 --> 00:54:39.350
So in the numerator,
this is what I get.
00:54:39.350 --> 00:54:44.870
That's the log 1 minus this guy,
which is equal to minus log 1,
00:54:44.870 --> 00:54:46.010
plus e to the theta.
00:54:50.790 --> 00:54:52.540
So I'm going a bit too
fast, but these are
00:54:52.540 --> 00:54:56.230
very elementary manipulations--
00:54:56.230 --> 00:54:59.220
maybe, it requires one more
line to convince yourself.
00:54:59.220 --> 00:55:05.940
But just do it in the
comfort of your room.
00:55:05.940 --> 00:55:11.210
And then, what you have is the
exponential of y times theta,
00:55:11.210 --> 00:55:16.850
and then, I have minus
log, 1 plus e theta.
00:55:16.850 --> 00:55:20.960
So this is the
representation of the p
00:55:20.960 --> 00:55:23.990
and f of a Bernoulli
distribution,
00:55:23.990 --> 00:55:27.680
as part of a member of the
canonical exponential family.
00:55:27.680 --> 00:55:33.530
And it tells me that b of
theta is equal to log 1,
00:55:33.530 --> 00:55:36.510
plus e of theta.
00:55:36.510 --> 00:55:38.140
That's what I have there.
00:55:38.140 --> 00:55:41.790
From there, I can compute the
expectation, which hopefully,
00:55:41.790 --> 00:55:46.170
I'm going to get p as
the mean and p times 1,
00:55:46.170 --> 00:55:47.759
minus p as the variance.
00:55:47.759 --> 00:55:49.050
Otherwise, that would be weird.
00:55:51.960 --> 00:55:55.840
So let's just do this.
00:55:55.840 --> 00:56:00.950
B prime of theta should
give me the mean.
00:56:00.950 --> 00:56:04.010
And indeed, b prime of
theta is e to the theta,
00:56:04.010 --> 00:56:08.060
divided by 1, plus e to
the theta, which is exactly
00:56:08.060 --> 00:56:09.290
this p that I had there.
00:56:14.850 --> 00:56:18.350
OK just for fun--
00:56:18.350 --> 00:56:19.200
well, I don't know.
00:56:19.200 --> 00:56:20.520
Maybe, that's not part of it.
00:56:20.520 --> 00:56:22.820
Yeah, let's not compute
the second derivative.
00:56:22.820 --> 00:56:25.800
That's probably going to be on
your homework at some point--
00:56:25.800 --> 00:56:29.440
if not, on the final.
00:56:29.440 --> 00:56:32.890
So b prime now--
00:56:32.890 --> 00:56:34.120
oh, I erased it, of course.
00:56:37.300 --> 00:56:39.380
G, the canonical link
is b prime inverse.
00:56:42.520 --> 00:56:44.770
And I claim that this
is going to give me
00:56:44.770 --> 00:56:48.910
the logit function, log
of mu, over 1 minus mu.
00:56:48.910 --> 00:56:50.480
So let's check that.
00:56:50.480 --> 00:56:54.236
So b prime is this
thing, so now,
00:56:54.236 --> 00:56:55.360
I want to find the inverse.
00:57:02.180 --> 00:57:05.360
Well, I should really call
my inverse a function of p.
00:57:05.360 --> 00:57:06.750
And I've done it before--
00:57:06.750 --> 00:57:08.930
all I have to do is to
solve this equation, which
00:57:08.930 --> 00:57:10.400
I've actually just
done it, that's
00:57:10.400 --> 00:57:11.890
where I'm actually coming from.
00:57:11.890 --> 00:57:14.510
So it's actually telling me
that the solution of this thing
00:57:14.510 --> 00:57:17.941
is equal to log of
p over 1 minus p.
00:57:25.810 --> 00:57:28.540
We just solve this
thing both ways.
00:57:28.540 --> 00:57:38.520
And this is, indeed, logit
of p by definition of logit.
00:57:38.520 --> 00:57:40.470
So b prime inverse,
this function that
00:57:40.470 --> 00:57:42.440
seemed to come out
of nowhere, is really
00:57:42.440 --> 00:57:45.030
just the inverse of b
prime, which we know
00:57:45.030 --> 00:57:46.200
is the canonical link.
00:57:46.200 --> 00:57:49.200
And canonical is some
sort of ad hoc choices
00:57:49.200 --> 00:57:53.040
that we've made by saying let's
just take the link, such that d
00:57:53.040 --> 00:57:57.819
of mu is giving me the actual
canonical parameter of theta.
00:57:57.819 --> 00:57:58.785
Yeah?
00:57:58.785 --> 00:58:00.717
AUDIENCE: [INAUDIBLE]
00:58:02.197 --> 00:58:03.530
PHILIPPE RIGOLLET: You're right.
00:58:08.520 --> 00:58:11.550
Now, of course, I'm going
through all this trouble,
00:58:11.550 --> 00:58:13.470
but you could see
it immediately.
00:58:13.470 --> 00:58:16.550
I know this is
going to be theta.
00:58:16.550 --> 00:58:19.380
We also have prior
knowledge, hopefully,
00:58:19.380 --> 00:58:23.520
that the expectation of
a Bernoulli is p itself.
00:58:23.520 --> 00:58:25.760
So right at this
step, when I say
00:58:25.760 --> 00:58:28.070
that I'm going to take
theta to be this guy,
00:58:28.070 --> 00:58:32.959
already knew that the
canonical link was the logit--
00:58:32.959 --> 00:58:34.500
because I just said
oh, here's theta.
00:58:34.500 --> 00:58:37.356
And it's just this
function of mu [INAUDIBLE]..
00:58:41.100 --> 00:58:43.539
OK, so you can do that
for a bunch of examples,
00:58:43.539 --> 00:58:45.330
and this is what they're
going to give you.
00:58:45.330 --> 00:58:47.820
So the Gaussian
case, b of theta--
00:58:47.820 --> 00:58:49.760
we've actually computed
it, actually, once.
00:58:49.760 --> 00:58:51.290
This is theta squared over 2.
00:58:51.290 --> 00:58:53.130
So the derivative of
this thing is really
00:58:53.130 --> 00:58:56.970
just theta, which means that
g or g inverse is actually
00:58:56.970 --> 00:58:59.280
equal to the identity.
00:58:59.280 --> 00:59:02.760
And again, sanity check--
00:59:02.760 --> 00:59:04.410
when I'm in the
Gaussian case, there's
00:59:04.410 --> 00:59:06.780
nothing general about
general linear models
00:59:06.780 --> 00:59:09.040
if you don't have a link.
00:59:09.040 --> 00:59:12.390
The Poisson case-- you
can actually check.
00:59:12.390 --> 00:59:13.480
Did we do this, actually?
00:59:13.480 --> 00:59:14.350
Yes we did.
00:59:14.350 --> 00:59:16.570
So that's when we
had this e of theta.
00:59:16.570 --> 00:59:19.960
And so b is e of theta, which
means that the natural link is
00:59:19.960 --> 00:59:24.790
the inverse, which is log, which
is the inverse of exponential.
00:59:24.790 --> 00:59:29.680
And so that's logarithm
link, which as I said,
00:59:29.680 --> 00:59:31.560
I used the word natural.
00:59:31.560 --> 00:59:33.610
You can also use
the word canonical
00:59:33.610 --> 00:59:35.740
if you want to describe
this function as being
00:59:35.740 --> 00:59:38.620
the right function to map
the positive real line
00:59:38.620 --> 00:59:40.959
to the entire real line.
00:59:40.959 --> 00:59:42.250
The Bernoulli-- we just did it.
00:59:42.250 --> 00:59:46.930
So b-- the cumulative
enduring function is log of 1,
00:59:46.930 --> 00:59:52.990
plus e of theta, which is
log of mu over 1 minus mu.
00:59:52.990 --> 00:59:57.520
And gamma function--
where you have
00:59:57.520 --> 01:00:00.700
the thing you're going to see is
minus log of minus [INAUDIBLE]..
01:00:00.700 --> 01:00:04.030
You see the reciprocal link
is the link that actually
01:00:04.030 --> 01:00:08.045
shows up, so minus 1 over mu.
01:00:08.045 --> 01:00:08.545
That maps.
01:00:35.690 --> 01:00:40.400
So are there any questions
about the canonical links,
01:00:40.400 --> 01:00:42.532
canonical families?
01:00:42.532 --> 01:00:45.020
I use the word canonical a lot.
01:00:45.020 --> 01:00:48.929
But is everything fitting
together right now?
01:00:48.929 --> 01:00:49.970
So we have this function.
01:00:49.970 --> 01:00:53.060
We have canonical exponential
family, by assumption.
01:00:53.060 --> 01:00:54.800
It has a function,
b, which contains
01:00:54.800 --> 01:00:56.552
every information we want.
01:00:56.552 --> 01:00:58.010
At the beginning
of the lecture, we
01:00:58.010 --> 01:00:59.468
established that
it has information
01:00:59.468 --> 01:01:01.310
about the mean in
the first derivative,
01:01:01.310 --> 01:01:03.143
about the variance in
the second derivative.
01:01:03.143 --> 01:01:05.210
And it's also giving
us a canonical link.
01:01:05.210 --> 01:01:08.035
So just cherish this b
once you've found it,
01:01:08.035 --> 01:01:09.410
because it's
everything you need.
01:01:09.410 --> 01:01:09.909
Yeah?
01:01:09.909 --> 01:01:11.750
AUDIENCE: [INAUDIBLE]
01:01:15.962 --> 01:01:19.342
PHILIPPE RIGOLLET: I don't
know, a political preference?
01:01:24.710 --> 01:01:26.730
I don't know, honestly.
01:01:26.730 --> 01:01:29.570
If I were a serious
practitioner,
01:01:29.570 --> 01:01:31.700
I probably would have a
better answer for you.
01:01:31.700 --> 01:01:32.870
At this point, I just don't.
01:01:32.870 --> 01:01:34.244
I think it's a
matter of practice
01:01:34.244 --> 01:01:36.860
and actual preferences.
01:01:36.860 --> 01:01:38.426
You can also try both.
01:01:38.426 --> 01:01:39.800
We didn't mention
it, but there's
01:01:39.800 --> 01:01:41.360
this idea of
cross-validation-- well,
01:01:41.360 --> 01:01:43.610
we mentioned it without
going too much into detail.
01:01:43.610 --> 01:01:46.460
But you could try both
and see which one performs
01:01:46.460 --> 01:01:48.617
best on a yet unseen data set.
01:01:48.617 --> 01:01:51.200
In terms of prediction, just say
I prefer this one of the two,
01:01:51.200 --> 01:01:53.450
because this actually comes
as part of your modeling
01:01:53.450 --> 01:01:56.090
assumption, right?
01:01:56.090 --> 01:01:59.630
Not only did you decide
to model the image of mu
01:01:59.630 --> 01:02:03.057
through the link function as
a linear model, but really
01:02:03.057 --> 01:02:03.890
what you're saying--
01:02:03.890 --> 01:02:05.750
your model is saying
well, you have
01:02:05.750 --> 01:02:07.860
two pieces of [INAUDIBLE],,
the distribution of y.
01:02:07.860 --> 01:02:10.340
But you also have
the fact that mu
01:02:10.340 --> 01:02:14.870
is modeled as g inverse
of x transpose beta.
01:02:14.870 --> 01:02:17.120
And for different g's, this
is just different modeling
01:02:17.120 --> 01:02:18.380
assumptions, right?
01:02:18.380 --> 01:02:25.930
So why should this be linear--
01:02:25.930 --> 01:02:26.610
I don't know.
01:02:29.470 --> 01:02:32.740
My authority as a
person who has not
01:02:32.740 --> 01:02:34.780
examined the
[INAUDIBLE] data sets
01:02:34.780 --> 01:02:38.050
for both things would be that
the changes are fairly minor.
01:02:42.270 --> 01:02:45.420
OK, so this was all
for one observation.
01:02:45.420 --> 01:02:49.350
We just, basically,
did probability.
01:02:49.350 --> 01:02:52.620
We described some density, some
properties of the densities,
01:02:52.620 --> 01:02:53.940
how to compute expectations.
01:02:53.940 --> 01:02:55.314
That was really
just probability.
01:02:55.314 --> 01:02:57.240
There was no data
involved at any point.
01:02:57.240 --> 01:03:00.330
We did a bit of modeling, but
it was all for one observation.
01:03:00.330 --> 01:03:01.710
What we're going
to try to do now
01:03:01.710 --> 01:03:06.360
is given the reverse
engineering to probability
01:03:06.360 --> 01:03:08.310
that is statistics,
given data, what
01:03:08.310 --> 01:03:10.780
can I infer about my model?
01:03:10.780 --> 01:03:12.370
Now remember, there's
three parameters
01:03:12.370 --> 01:03:15.040
that are floating
around in this model.
01:03:15.040 --> 01:03:18.190
There is one that was theta.
01:03:18.190 --> 01:03:21.689
There is one that was mu, and
there is one that is beta.
01:03:21.689 --> 01:03:23.230
OK, so those are
the three parameters
01:03:23.230 --> 01:03:25.110
that are floating around.
01:03:25.110 --> 01:03:32.550
What we said is that the
expectation of y, given x,
01:03:32.550 --> 01:03:34.980
is mu of x.
01:03:34.980 --> 01:03:37.950
So if I estimate mu, I know the
conditional expectation of y,
01:03:37.950 --> 01:03:44.960
given x, which definitely
gives me theta of x.
01:03:44.960 --> 01:03:46.830
How do I go from mu
of x to theta of x?
01:03:55.080 --> 01:03:58.010
The inverse of what--
01:03:58.010 --> 01:03:59.890
of the arrow?
01:03:59.890 --> 01:04:07.290
Yeah, sure, but how do I go
from this guy to this guy?
01:04:07.290 --> 01:04:08.860
So theta as a function of mu is?
01:04:12.556 --> 01:04:13.792
AUDIENCE: [INAUDIBLE]
01:04:13.792 --> 01:04:15.250
PHILIPPE RIGOLLET:
Yeah, so we just
01:04:15.250 --> 01:04:18.760
computed that mu was
b prime of theta.
01:04:18.760 --> 01:04:23.260
So it means that theta is
just b prime inverse of mu.
01:04:23.260 --> 01:04:24.910
So those two things
are the same as far
01:04:24.910 --> 01:04:27.580
as we're concerned, because we
know that b prime is strictly
01:04:27.580 --> 01:04:29.020
increasing it's invertible.
01:04:29.020 --> 01:04:31.560
So it's just a matter
of re-parametrization,
01:04:31.560 --> 01:04:34.420
and we just can switch from one
to the other whenever we want.
01:04:34.420 --> 01:04:36.754
But why we go through
mu, because so far
01:04:36.754 --> 01:04:38.170
for the entire
semester I told you
01:04:38.170 --> 01:04:39.150
there was one
parameter that's theta.
01:04:39.150 --> 01:04:41.420
It does not have to be the
mean, and that's the parameter
01:04:41.420 --> 01:04:42.130
that we care about.
01:04:42.130 --> 01:04:43.700
It's the one on which we
want to do interference.
01:04:43.700 --> 01:04:45.580
That's the one for which we're
going to compute the Fisher
01:04:45.580 --> 01:04:46.360
information.
01:04:46.360 --> 01:04:49.572
This was the parameter that
was our object of worship.
01:04:49.572 --> 01:04:51.280
And now, I'm saying
oh, I'm going to have
01:04:51.280 --> 01:04:53.200
mu that's coming around.
01:04:53.200 --> 01:04:55.270
And why we have mu,
because this is the mu
01:04:55.270 --> 01:04:58.390
that we use to go to beta.
01:04:58.390 --> 01:05:06.360
So I can go freely from theta
to mu using b prime or b
01:05:06.360 --> 01:05:07.600
prime inverse.
01:05:07.600 --> 01:05:11.080
And now, I can go
from mu to beta,
01:05:11.080 --> 01:05:19.120
because I have that g of mu
of x is beta transpose x.
01:05:19.120 --> 01:05:21.130
So in the end,
now, this is going
01:05:21.130 --> 01:05:22.360
to be my object of worship.
01:05:22.360 --> 01:05:24.318
This is going to be the
parameter that matters.
01:05:24.318 --> 01:05:27.910
Because once I set beta,
I set everything else
01:05:27.910 --> 01:05:30.290
through this chain.
01:05:30.290 --> 01:05:33.010
So the question is if I
start stacking up this pile
01:05:33.010 --> 01:05:36.260
of parameters-- so I
start with my beta,
01:05:36.260 --> 01:05:38.520
which in turns give me
a mu, which in turn,
01:05:38.520 --> 01:05:39.580
gives me a theta--
01:05:39.580 --> 01:05:43.720
can I just have a
long, streamlined--
01:05:43.720 --> 01:05:45.640
what is the outcome
when I actually
01:05:45.640 --> 01:05:48.016
start writing my likelihood,
not as a function of theta,
01:05:48.016 --> 01:05:50.140
not as a function of mu,
but as a function of beta,
01:05:50.140 --> 01:05:52.720
which is the one at
the end of the chain?
01:05:52.720 --> 01:05:55.540
And hopefully, things are
going to happen nicely,
01:05:55.540 --> 01:05:56.292
and they might no.
01:05:56.292 --> 01:05:56.792
Yeah?
01:05:56.792 --> 01:05:58.702
AUDIENCE: [INAUDIBLE]
01:06:02.076 --> 01:06:03.680
PHILIPPE RIGOLLET: Is G--
01:06:03.680 --> 01:06:04.800
that's my link.
01:06:04.800 --> 01:06:06.710
G of mu of x--
01:06:06.710 --> 01:06:09.320
now, its mu is a function of x,
because its conditional on x.
01:06:12.200 --> 01:06:17.000
So this is really
theta of x, mu of x,
01:06:17.000 --> 01:06:21.100
but b is not a function of x,
because it's just something
01:06:21.100 --> 01:06:22.965
to tells me what the
function of x is.
01:06:22.965 --> 01:06:24.865
AUDIENCE: [INAUDIBLE]
01:06:26.074 --> 01:06:28.240
PHILIPPE RIGOLLET: Mu is
the conditional expectation
01:06:28.240 --> 01:06:29.770
of y, given x.
01:06:29.770 --> 01:06:33.010
It has, actually, a fancy name
in the statistics literature.
01:06:33.010 --> 01:06:36.989
It's called-- anybody knows
the name of the function, mu
01:06:36.989 --> 01:06:39.280
of x, which is a conditional
expectation of y, given x?
01:06:42.116 --> 01:06:43.960
AUDIENCE: [INAUDIBLE]
01:06:43.960 --> 01:06:46.120
PHILIPPE RIGOLLET: That's
the regression function.
01:06:46.120 --> 01:06:47.230
That's actual definition.
01:06:47.230 --> 01:06:48.970
If I tell you what is the
definition of the regression
01:06:48.970 --> 01:06:51.011
function, that's just the
conditional expectation
01:06:51.011 --> 01:06:52.970
of why, given x.
01:06:52.970 --> 01:06:58.720
And I could look at any property
of the conditional distribution
01:06:58.720 --> 01:07:00.020
of y given x.
01:07:00.020 --> 01:07:02.639
I could look at the
conditional 95th percentile.
01:07:02.639 --> 01:07:04.180
I can look at the
conditional median.
01:07:04.180 --> 01:07:06.450
I can look at the conditional
[INAUDIBLE] range.
01:07:06.450 --> 01:07:08.470
I can look at the
conditional variance.
01:07:08.470 --> 01:07:12.300
But I decide to look at the
conditional expectation, which
01:07:12.300 --> 01:07:15.429
is called the
regression function.
01:07:18.363 --> 01:07:19.341
Yes?
01:07:19.341 --> 01:07:21.297
AUDIENCE: [INAUDIBLE]
01:07:24.231 --> 01:07:26.290
PHILIPPE RIGOLLET: Oh,
there's no transpose here.
01:07:26.290 --> 01:07:28.700
Actually, only Victor-Emmanuel
used this prime for transpose,
01:07:28.700 --> 01:07:30.710
and I found it confusing
with the derivatives.
01:07:30.710 --> 01:07:33.306
So primes here is
only a derivative.
01:07:33.306 --> 01:07:34.623
AUDIENCE: [INAUDIBLE]
01:07:35.122 --> 01:07:38.640
PHILIPPE RIGOLLET: Oh, yeah,
sorry, beta transpose x.
01:07:38.640 --> 01:07:40.350
So you said what?
01:07:40.350 --> 01:07:43.245
I said that g of mu of
x is beta transpose x?
01:07:43.245 --> 01:07:45.145
AUDIENCE: [INAUDIBLE]
01:07:48.035 --> 01:07:49.910
PHILIPPE RIGOLLET: Isn't
that the same thing?
01:07:52.510 --> 01:07:53.970
X is a vector here, right?
01:07:53.970 --> 01:07:54.930
AUDIENCE: Yeah.
01:07:54.930 --> 01:07:56.555
PHILIPPE RIGOLLET:
So x transpose beta,
01:07:56.555 --> 01:08:00.348
and beta transpose x
are of the same thing.
01:08:00.348 --> 01:08:02.280
AUDIENCE: [INAUDIBLE]
01:08:03.979 --> 01:08:05.770
PHILIPPE RIGOLLET: So
beta looks like this.
01:08:05.770 --> 01:08:08.706
X looks like this.
01:08:08.706 --> 01:08:12.420
It's just a simple number.
01:08:12.420 --> 01:08:13.386
Yeah, you're right.
01:08:13.386 --> 01:08:15.010
I'm going to start
to look at matrices.
01:08:15.010 --> 01:08:18.189
I'm going to have to be slightly
more careful when I do this.
01:08:18.189 --> 01:08:20.740
OK so let's do the
reverse engineering.
01:08:20.740 --> 01:08:22.199
I'm giving you data.
01:08:22.199 --> 01:08:23.740
From this data,
hopefully, you should
01:08:23.740 --> 01:08:26.979
be able to get what the
conditional-- if I give you
01:08:26.979 --> 01:08:29.630
an infinite amount of data,
you would know exactly,
01:08:29.630 --> 01:08:33.819
of pairs xy, what the
conditional distribution of y
01:08:33.819 --> 01:08:36.130
given x is.
01:08:36.130 --> 01:08:37.770
And in particular,
you would know
01:08:37.770 --> 01:08:40.560
what the conditional
expectation of y given x
01:08:40.560 --> 01:08:42.359
is, which means that
you would know mu,
01:08:42.359 --> 01:08:44.192
which means that you
would know theta, which
01:08:44.192 --> 01:08:45.920
means that you would know beta.
01:08:45.920 --> 01:08:48.600
Now, when I have a finite
number of observations,
01:08:48.600 --> 01:08:50.910
I'm going to try to
estimate mu of x.
01:08:50.910 --> 01:08:53.250
But really, I'm going to
go the other way around.
01:08:53.250 --> 01:08:56.279
Because the fact that I assume,
specifically, that mu of x
01:08:56.279 --> 01:09:00.510
is of the form g of mu of x
is x transpose beta, then that
01:09:00.510 --> 01:09:02.850
means that I only have
to estimate beta, which
01:09:02.850 --> 01:09:06.432
is a much simpler object than
the entire regression function.
01:09:06.432 --> 01:09:07.890
So that's what I'm
going to go for.
01:09:07.890 --> 01:09:10.330
I'm going to try to represent
the likelihood, the log
01:09:10.330 --> 01:09:12.890
likelihood, of my data as
a function, not of theta,
01:09:12.890 --> 01:09:15.390
not of mu, but of beta--
01:09:15.390 --> 01:09:18.120
and then, maximize that guy.
01:09:18.120 --> 01:09:21.870
So now, rather than thinking
of just one observation,
01:09:21.870 --> 01:09:23.940
I'm going to have a
bunch of observations.
01:09:27.100 --> 01:09:29.069
So this might actually
look a little confusing,
01:09:29.069 --> 01:09:32.189
but let's just make sure
that we understand each other
01:09:32.189 --> 01:09:33.700
before we go any further.
01:09:33.700 --> 01:09:38.510
So I'm going to
have observations,
01:09:38.510 --> 01:09:43.359
x1, y1, all the
way to xn, yn, just
01:09:43.359 --> 01:09:45.310
like in a natural
regression problem,
01:09:45.310 --> 01:09:49.180
except that here my y's
might be 0 one valued.
01:09:49.180 --> 01:09:50.649
They might be positive valued.
01:09:50.649 --> 01:09:51.732
They might be exponential.
01:09:51.732 --> 01:09:54.600
They might be anything in the
canonical exponential family.
01:09:57.840 --> 01:09:59.950
OK so I have this
thing, and now,
01:09:59.950 --> 01:10:01.900
what I have is that my
observations are x1,
01:10:01.900 --> 01:10:03.310
y1, xn, yn.
01:10:03.310 --> 01:10:06.460
And what I want
is that I'm going
01:10:06.460 --> 01:10:11.640
to assume that the conditional
expectation of yi, given--
01:10:14.980 --> 01:10:18.710
the conditional distribution
of yi, given xi,
01:10:18.710 --> 01:10:20.390
is something that has density.
01:10:30.070 --> 01:10:31.473
Did I put an i on y-- yeah.
01:10:42.820 --> 01:10:45.920
I'm not going to deal with
the phi and the c now.
01:10:45.920 --> 01:10:48.610
And why do I have
theta i and not theta
01:10:48.610 --> 01:11:01.350
is because theta i is
really a function of xi.
01:11:01.350 --> 01:11:05.270
So it's really theta i of xi.
01:11:05.270 --> 01:11:07.240
But what do I know
about theta i of xi,
01:11:07.240 --> 01:11:11.890
it's actually equal to b--
01:11:11.890 --> 01:11:13.920
I did this error twice--
01:11:13.920 --> 01:11:16.450
b prime inverse of mu of xi.
01:11:30.620 --> 01:11:34.160
And I'm going to assume that
this is of the form beta
01:11:34.160 --> 01:11:36.190
transpose xi.
01:11:36.190 --> 01:11:37.810
And this is why I have theta i--
01:11:37.810 --> 01:11:40.414
is because this theta
i is a function of xi,
01:11:40.414 --> 01:11:42.830
and I'm going to assume a very
simple form for this thing.
01:11:46.030 --> 01:11:48.747
Sorry, sorry, sorry, sorry--
01:11:48.747 --> 01:11:50.080
I should not write it like this.
01:11:50.080 --> 01:11:51.980
This is only when I
have the canonical link.
01:11:51.980 --> 01:11:57.310
So this is actually equal
to b prime inverse of g,
01:11:57.310 --> 01:11:59.650
of xi transpose beta.
01:12:05.010 --> 01:12:07.754
Sorry, g inverse--
those two things
01:12:07.754 --> 01:12:09.170
are actually
canceling each other.
01:12:13.760 --> 01:12:17.735
So as before, I'm going to
stack everything into some--
01:12:17.735 --> 01:12:20.360
well, actually, I'm not going to
stack anything for the moment.
01:12:20.360 --> 01:12:22.151
I'm just going to give
you a peek at what's
01:12:22.151 --> 01:12:28.010
happening next week, rather
than just manipulating the data.
01:12:28.010 --> 01:12:33.810
So here is how we're going
to proceed at this point.
01:12:33.810 --> 01:12:36.540
Well now, I want to write
my likelihood function,
01:12:36.540 --> 01:12:39.270
not as a function of theta,
but as a function of beta,
01:12:39.270 --> 01:12:44.270
because that's the parameter
I'm actually trying to maximize.
01:12:44.270 --> 01:12:47.050
So if I have a link--
01:12:47.050 --> 01:12:50.455
so this thing that matters
here, I'm going to call h.
01:12:53.600 --> 01:12:58.190
By definition, this is going
to be h of xi transpose beta.
01:12:58.190 --> 01:13:00.080
Helena, you have a question?
01:13:00.080 --> 01:13:02.069
AUDIENCE: Uh, no [INAUDIBLE]
01:13:02.069 --> 01:13:04.110
PHILIPPE RIGOLLET: So this
is just all the things
01:13:04.110 --> 01:13:04.930
that we know.
01:13:04.930 --> 01:13:09.150
Theta is just the, by
definition of the fact that mu
01:13:09.150 --> 01:13:11.505
is b prime of theta, the
mean is b prime of theta--
01:13:11.505 --> 01:13:14.250
it means that theta is
b prime inverse of mu.
01:13:14.250 --> 01:13:19.190
And then, mu is modeled from
the systematic component.
01:13:19.190 --> 01:13:21.940
G of mu is xi transpose
beta, so this is
01:13:21.940 --> 01:13:23.590
g inverse of xi transpose beta.
01:13:23.590 --> 01:13:27.810
So I want to have b prime
inverse of g inverse.
01:13:27.810 --> 01:13:30.030
This function is a
bit annoying to say,
01:13:30.030 --> 01:13:32.750
so I'm just going to call it h.
01:13:32.750 --> 01:13:34.837
And when I do the
composition of two inverses,
01:13:34.837 --> 01:13:36.920
the inverse of the composition
of those two things
01:13:36.920 --> 01:13:38.280
in the reverse order--
01:13:38.280 --> 01:13:42.140
so h is really the inverse
of g, composed with b
01:13:42.140 --> 01:13:46.677
prime, g of b prime inverse.
01:13:46.677 --> 01:13:48.260
And now, if I have
the canonical link,
01:13:48.260 --> 01:13:51.200
since I know that g
is b prime inverse,
01:13:51.200 --> 01:13:54.180
this is really
just the identity.
01:13:54.180 --> 01:13:58.109
As you can imagine,
this entire thing,
01:13:58.109 --> 01:13:59.650
which is actually
quite complicated--
01:13:59.650 --> 01:14:01.750
would just say oh, this thing,
actually, does not show up
01:14:01.750 --> 01:14:03.041
when I have the canonical link.
01:14:03.041 --> 01:14:06.370
I really just have that theta
can be replaced by xi of beta.
01:14:06.370 --> 01:14:09.280
So think about going
back to this guy here.
01:14:09.280 --> 01:14:15.160
Now, theta becomes
only xi transpose beta.
01:14:15.160 --> 01:14:18.425
That's going to be much
more simple to optimize,
01:14:18.425 --> 01:14:20.550
because remember, when I'm
going to log likelihood,
01:14:20.550 --> 01:14:21.841
this thing is going to go away.
01:14:21.841 --> 01:14:23.020
I'm going to sum those guys.
01:14:23.020 --> 01:14:24.310
And so what I'm going to
have is something which
01:14:24.310 --> 01:14:26.140
is essentially linear in beta.
01:14:26.140 --> 01:14:28.340
And then, I'm going
to have this minus b,
01:14:28.340 --> 01:14:31.760
which is just minus the sum
of convex functions of beta.
01:14:31.760 --> 01:14:34.220
And so I'm going to have to
bring in the tools of convex
01:14:34.220 --> 01:14:34.860
optimization.
01:14:34.860 --> 01:14:37.566
Now, it's not just going to be
take the gradient, set it to 0.
01:14:37.566 --> 01:14:39.440
It's going to be more
complicated to do that.
01:14:39.440 --> 01:14:42.320
I'm going to have to do that
in an iterative fashion.
01:14:42.320 --> 01:14:43.800
And so that's what
I'm telling you,
01:14:43.800 --> 01:14:46.400
when you look at your
log likelihood for all
01:14:46.400 --> 01:14:47.330
those functions.
01:14:47.330 --> 01:14:50.062
You sum, the exponential goes
away because you had the log,
01:14:50.062 --> 01:14:51.770
and then, you have
all these things here.
01:14:51.770 --> 01:14:52.660
I kept the b.
01:14:52.660 --> 01:14:53.990
I kept the h.
01:14:53.990 --> 01:14:56.690
But if h is the identity,
this is the linear function,
01:14:56.690 --> 01:14:59.210
the linear part, yi
times xi transpose
01:14:59.210 --> 01:15:03.776
beta, minus b of my theta, which
is now only xi transpose beta.
01:15:03.776 --> 01:15:05.900
And that's the function I
want to maximize in beta.
01:15:10.370 --> 01:15:11.390
It's a convex function.
01:15:11.390 --> 01:15:15.130
When I know what b is, I have
an explicit formula for this,
01:15:15.130 --> 01:15:18.230
and I want to just bring
in some optimization.
01:15:18.230 --> 01:15:19.682
And that's what
we're going to do,
01:15:19.682 --> 01:15:21.890
and we're going to see three
different methods, which
01:15:21.890 --> 01:15:24.110
are really, basically,
the same method.
01:15:24.110 --> 01:15:28.760
It's just an adaptation
or specialization
01:15:28.760 --> 01:15:31.550
of the so-called Newton-Raphson
method, which is essentially
01:15:31.550 --> 01:15:34.735
telling you do iterative
local quadratic approximations
01:15:34.735 --> 01:15:36.360
through your function--
so second order
01:15:36.360 --> 01:15:38.480
[INAUDIBLE] expansion,
minimize this guy,
01:15:38.480 --> 01:15:41.060
and then do it again
from where you were.
01:15:41.060 --> 01:15:43.460
And we'll see that
this can be, actually,
01:15:43.460 --> 01:15:47.210
implemented using what's called
iteratively re-weighted least
01:15:47.210 --> 01:15:49.640
squares, which means
that every step--
01:15:49.640 --> 01:15:51.200
since it's just
a quadratic, it's
01:15:51.200 --> 01:15:53.090
going to be just
squares in there--
01:15:53.090 --> 01:15:56.190
can actually be solved
by using a weighted least
01:15:56.190 --> 01:15:59.420
squares version of the problem.
01:15:59.420 --> 01:16:02.270
So I'm going to
stop here for today.
01:16:02.270 --> 01:16:05.930
So we'll continue and probably
not finish this chapter,
01:16:05.930 --> 01:16:07.440
but finish next week.
01:16:07.440 --> 01:16:10.670
And then, I think
there's only one lecture.
01:16:10.670 --> 01:16:13.310
Actually, for the last lecture,
what do you guys want to do?
01:16:16.320 --> 01:16:18.460
Do you want to have
doughnuts and cider?
01:16:18.460 --> 01:16:25.620
Do you want to just have
some more outlooking lecture
01:16:25.620 --> 01:16:31.390
on what's happening
post 1975 in statistics?
01:16:31.390 --> 01:16:36.130
Do you want to have a
review for the final exam--
01:16:36.130 --> 01:16:38.970
pragmatic people.
01:16:38.970 --> 01:16:43.300
AUDIENCE: [INAUDIBLE]
interesting, advanced topics.
01:16:43.300 --> 01:16:46.100
PHILIPPE RIGOLLET: You want
to do interesting, advanced--
01:16:46.100 --> 01:16:48.200
for the last lecture?
01:16:48.200 --> 01:16:50.470
AUDIENCE: Something that
we haven't thought of yet.
01:16:50.470 --> 01:16:53.920
PHILIPPE RIGOLLET: Yeah, that's,
basically, what I'm asking,
01:16:53.920 --> 01:16:55.420
right-- interesting
advanced topics,
01:16:55.420 --> 01:17:00.694
versus ask me any
question you want.
01:17:00.694 --> 01:17:03.110
Those questions can be about
interesting, advanced topics,
01:17:03.110 --> 01:17:03.850
though.
01:17:03.850 --> 01:17:06.020
Like, what are interesting,
advanced topics?
01:17:06.020 --> 01:17:06.547
I'm sorry?
01:17:06.547 --> 01:17:08.630
AUDIENCE: Interesting with
doughnuts-- is that OK?
01:17:08.630 --> 01:17:10.963
PHILIPPE RIGOLLET: Yeah, we
can always do the doughnuts.
01:17:10.963 --> 01:17:11.838
[LAUGHTER]
01:17:11.838 --> 01:17:14.792
AUDIENCE: As long as
there are doughnuts.
01:17:14.792 --> 01:17:16.750
PHILIPPE RIGOLLET: All
right, so we'll do that.
01:17:16.750 --> 01:17:19.500
So you guys have a good weekend.