1
00:00:00,120 --> 00:00:02,460
The following content is
provided under a Creative

2
00:00:02,460 --> 00:00:03,880
Commons license.

3
00:00:03,880 --> 00:00:06,090
Your support will help
MIT OpenCourseWare

4
00:00:06,090 --> 00:00:10,180
continue to offer high-quality
educational resources for free.

5
00:00:10,180 --> 00:00:12,720
To make a donation or to
view additional materials

6
00:00:12,720 --> 00:00:16,680
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,680 --> 00:00:17,880
at ocw.mit.edu.

8
00:00:19,839 --> 00:00:22,380
PHILIPPE RIGOLLET: We're talking
about goodness-of-fit tests.

9
00:00:22,380 --> 00:00:25,470
Goodness-of-fit tests
are, does my data

10
00:00:25,470 --> 00:00:27,472
come from a particular
distribution?

11
00:00:27,472 --> 00:00:28,930
And why would we
want to know this?

12
00:00:28,930 --> 00:00:32,100
Well, maybe we're
interested in, for example,

13
00:00:32,100 --> 00:00:36,900
knowing if the zodiac signs
of the Fortune 500 CEOs

14
00:00:36,900 --> 00:00:38,550
are uniformly distributed.

15
00:00:38,550 --> 00:00:41,310
Or maybe we actually
have slightly more--

16
00:00:41,310 --> 00:00:44,740
slightly deeper endeavors,
such as understanding

17
00:00:44,740 --> 00:00:48,240
if you can actually apply the
t-test by testing normality

18
00:00:48,240 --> 00:00:49,121
of your sample.

19
00:00:49,121 --> 00:00:49,620
All right?

20
00:00:49,620 --> 00:00:51,900
So we saw that there's
the main result--

21
00:00:51,900 --> 00:00:53,460
the main standard test for this.

22
00:00:53,460 --> 00:00:55,800
It's called the
Kolmogorov-Smirnov test

23
00:00:55,800 --> 00:00:57,690
that people use quite a bit.

24
00:00:57,690 --> 00:01:00,840
It's probably one of the
most used tests out there.

25
00:01:00,840 --> 00:01:05,570
And there's other versions of
it that I mentioned passing by.

26
00:01:05,570 --> 00:01:08,010
There's the Cramer-von
Mises, and there's

27
00:01:08,010 --> 00:01:09,630
the Anderson-Darling test.

28
00:01:09,630 --> 00:01:12,120
Now, how would you
pick one of such tests?

29
00:01:12,120 --> 00:01:14,880
Well, they're always are
going to-- they're always

30
00:01:14,880 --> 00:01:17,700
going to have their
advantages and disadvantages.

31
00:01:17,700 --> 00:01:22,300
And Kolmogorov-Smirnov is
definitely the most widely used

32
00:01:22,300 --> 00:01:23,130
because--

33
00:01:23,130 --> 00:01:24,990
well, I guess because
it's a natural notion

34
00:01:24,990 --> 00:01:26,240
of distance between functions.

35
00:01:26,240 --> 00:01:28,530
You just look for each
point how far they can be,

36
00:01:28,530 --> 00:01:30,150
and you just look
at the farthest

37
00:01:30,150 --> 00:01:31,830
they can be everywhere.

38
00:01:31,830 --> 00:01:34,670
Now, Cramer-von Mises
involves L2 distance.

39
00:01:34,670 --> 00:01:39,690
So if you're not used to
Hilbert spaces or notions

40
00:01:39,690 --> 00:01:43,076
of Euclidean spaces, at least
it's a little more complicated.

41
00:01:43,076 --> 00:01:44,700
And then Anderson-Darling
is definitely

42
00:01:44,700 --> 00:01:45,666
even more complicated.

43
00:01:45,666 --> 00:01:47,040
Now, each of these
tests is going

44
00:01:47,040 --> 00:01:49,720
to be more powerful
against other alternatives.

45
00:01:49,720 --> 00:01:52,560
So unless you can really
guess which alternative

46
00:01:52,560 --> 00:01:54,560
you're expecting to
see, which you probably

47
00:01:54,560 --> 00:01:56,810
don't, because, again, you're
in a case where you want

48
00:01:56,810 --> 00:02:00,450
to typically declare H0
to be the correct one,

49
00:02:00,450 --> 00:02:04,950
then it's really a
matter of tossing a coin.

50
00:02:04,950 --> 00:02:06,600
Maybe you can run
all three of them

51
00:02:06,600 --> 00:02:09,170
and just sleep better at night,
because all three of them

52
00:02:09,170 --> 00:02:11,245
have failed to
reject, for example.

53
00:02:11,245 --> 00:02:12,130
All right?

54
00:02:12,130 --> 00:02:15,250
So as I mentioned, one of
the maybe primary goals

55
00:02:15,250 --> 00:02:19,420
to test goodness of fit
is to be able to check

56
00:02:19,420 --> 00:02:22,534
whether we can apply
Student's test, right,

57
00:02:22,534 --> 00:02:24,325
and if the Student
distribution is actually

58
00:02:24,325 --> 00:02:25,750
a valid distribution.

59
00:02:25,750 --> 00:02:28,570
And for that, we need to have
normally distributed data.

60
00:02:28,570 --> 00:02:32,470
Now, as I said several
times, normally distributed,

61
00:02:32,470 --> 00:02:34,360
it's not a specific
distribution.

62
00:02:34,360 --> 00:02:35,950
It's a family of
distributions that's

63
00:02:35,950 --> 00:02:38,467
indexed by means and variances.

64
00:02:38,467 --> 00:02:41,050
And the way I would want to test
if a distribution is normally

65
00:02:41,050 --> 00:02:42,790
distributed is,
well, I would just

66
00:02:42,790 --> 00:02:45,460
look at the most natural
normal distribution

67
00:02:45,460 --> 00:02:48,010
or Gaussian distribution
that my data could follow.

68
00:02:48,010 --> 00:02:50,260
That means that's the
Gaussian distribution that

69
00:02:50,260 --> 00:02:53,635
has the same mean as my data
and the same empirical variance

70
00:02:53,635 --> 00:02:55,350
as my data, right?

71
00:02:55,350 --> 00:03:00,080
And so I'm going to be
given some points x1, xn,

72
00:03:00,080 --> 00:03:02,205
and I'm going to be
asking, are those Gaussian?

73
00:03:04,970 --> 00:03:07,310
That means this is
equivalent to, say,

74
00:03:07,310 --> 00:03:15,800
are they N mu sigma square
for some mu sigma squared?

75
00:03:15,800 --> 00:03:17,600
And of course,
the natural choice

76
00:03:17,600 --> 00:03:20,610
is to take mu hat to be--

77
00:03:20,610 --> 00:03:23,720
mu to be equal to mu
hat, which is xn bar.

78
00:03:23,720 --> 00:03:30,520
And sigma squared to be
sigma squared hat to be,

79
00:03:30,520 --> 00:03:32,990
well, Sn hat--

80
00:03:32,990 --> 00:03:37,580
Sn-- what we wrote Sn, which
is 1/n sum from i equal 1 to n

81
00:03:37,580 --> 00:03:40,800
of xi minus xn bar squared.

82
00:03:40,800 --> 00:03:41,720
OK?

83
00:03:41,720 --> 00:03:44,152
So this is definitely
the natural one

84
00:03:44,152 --> 00:03:45,110
you would want to test.

85
00:03:45,110 --> 00:03:47,210
And maybe you could actually
just close your eyes

86
00:03:47,210 --> 00:03:52,920
and just stuff that in a
Kolmogorov-Smirnov test.

87
00:03:52,920 --> 00:03:53,420
OK?

88
00:03:53,420 --> 00:03:55,490
So here, there's a few
things that don't work.

89
00:03:55,490 --> 00:03:57,657
The first one is that
Donsker's theorem does not

90
00:03:57,657 --> 00:03:58,490
work anymore, right?

91
00:03:58,490 --> 00:03:59,906
Donsker's theorem
was the one that

92
00:03:59,906 --> 00:04:02,840
told us that,
properly normalized,

93
00:04:02,840 --> 00:04:04,460
this thing would
actually converge

94
00:04:04,460 --> 00:04:07,800
to the supremum of a Brownian
bridge, which is not true.

95
00:04:07,800 --> 00:04:08,969
So that's one problem.

96
00:04:08,969 --> 00:04:10,760
But there's actually
an even bigger problem

97
00:04:10,760 --> 00:04:13,730
is that this distribution,
we will check in a second,

98
00:04:13,730 --> 00:04:15,080
actually does not--

99
00:04:15,080 --> 00:04:19,011
is pivotal itself, right,
the statistic is pivotal.

100
00:04:19,011 --> 00:04:20,510
It does not have a
distribution that

101
00:04:20,510 --> 00:04:22,093
depends on the known
parameters, which

102
00:04:22,093 --> 00:04:24,810
is sort of nice, at
least under the null.

103
00:04:24,810 --> 00:04:27,950
However, the distribution
is not the same

104
00:04:27,950 --> 00:04:31,010
as the one that had
fixed mu and sigma.

105
00:04:31,010 --> 00:04:34,250
The fact that they come
from some random variables

106
00:04:34,250 --> 00:04:36,892
is actually distorting
the distribution itself.

107
00:04:36,892 --> 00:04:39,350
And in particular, the quantiles
are going to be distorted,

108
00:04:39,350 --> 00:04:41,840
and we hinted at that last time.

109
00:04:41,840 --> 00:04:44,070
So one other thing I
need to tell you, though,

110
00:04:44,070 --> 00:04:48,460
is that this thing actually--
so I know there's some--

111
00:04:48,460 --> 00:04:51,890
oh, yeah, that's where
there's a word missing.

112
00:04:51,890 --> 00:04:54,680
So we compute the quantiles
for this test statistic.

113
00:04:54,680 --> 00:04:56,570
And so what I need
to promise to you

114
00:04:56,570 --> 00:04:59,810
is that these
quantiles do not depend

115
00:04:59,810 --> 00:05:01,710
on any unknown parameter, right?

116
00:05:01,710 --> 00:05:06,660
I mean, it's not clear, right?

117
00:05:06,660 --> 00:05:09,300
So I want to test whether
my data has some Gaussian

118
00:05:09,300 --> 00:05:09,930
distribution.

119
00:05:09,930 --> 00:05:15,270
So under the null, all I
know is that my xi's are

120
00:05:15,270 --> 00:05:18,300
Gaussian with some mean mu
and some variance sigma,

121
00:05:18,300 --> 00:05:19,386
which I don't know.

122
00:05:19,386 --> 00:05:20,760
So it could be
the case that when

123
00:05:20,760 --> 00:05:23,220
I try to understand the
distribution of this quantity

124
00:05:23,220 --> 00:05:28,320
under the null, it depends on mu
and sigma, which I don't know.

125
00:05:28,320 --> 00:05:30,930
So we need to check
that this is the case.

126
00:05:30,930 --> 00:05:33,810
And what's actually
our redemption

127
00:05:33,810 --> 00:05:37,754
here is actually going
to be the supremum.

128
00:05:37,754 --> 00:05:39,420
The supremum is going
to basically allow

129
00:05:39,420 --> 00:05:43,726
us to, say, sup out
mu and sigma square.

130
00:05:43,726 --> 00:05:44,850
So let's check that, right?

131
00:05:44,850 --> 00:05:48,330
So what I'm interested in
is this quantity, supremum

132
00:05:48,330 --> 00:05:54,440
over t and R of the
difference between Fn of t

133
00:05:54,440 --> 00:06:02,210
and, what I write, phi mu
hat sigma squared of t.

134
00:06:02,210 --> 00:06:07,360
So phi mu hat
sigma hat squared--

135
00:06:07,360 --> 00:06:09,140
sorry, sigma hat squared--

136
00:06:09,140 --> 00:06:15,320
is the CDF of some Gaussian with
mean mu hat and variance sigma

137
00:06:15,320 --> 00:06:16,530
hat squared.

138
00:06:16,530 --> 00:06:24,680
And so in particular, this
thing here, phi hat of mu hat--

139
00:06:24,680 --> 00:06:30,170
sorry, phi hat of mu hat
sigma hat squared of t

140
00:06:30,170 --> 00:06:34,160
is the probability that
some x is less than t,

141
00:06:34,160 --> 00:06:39,870
where x follows some N
mu hat sigma hat squared.

142
00:06:39,870 --> 00:06:42,650
So what it means is that
by just the translation

143
00:06:42,650 --> 00:06:44,390
and scaling trig
that we typically do

144
00:06:44,390 --> 00:06:47,930
for Gaussian to turn it into
some standard Gaussian, that

145
00:06:47,930 --> 00:06:50,660
implies that there
exists some z, which

146
00:06:50,660 --> 00:06:54,230
is standard Gaussian this
time, so mean 0 and variance 1,

147
00:06:54,230 --> 00:06:58,380
such that x is equal
to sigma hat x--

148
00:06:58,380 --> 00:07:02,510
sorry, z plus mu hat.

149
00:07:02,510 --> 00:07:04,060
Agreed?

150
00:07:04,060 --> 00:07:08,570
That's basically saying that x
has some Gaussian with mean mu

151
00:07:08,570 --> 00:07:09,875
and variance sigma squared.

152
00:07:09,875 --> 00:07:13,930
And I'm not going to say the
hats every single time, OK?

153
00:07:13,930 --> 00:07:17,220
So OK, so that's what it means.

154
00:07:17,220 --> 00:07:20,260
So in particular, maybe
I shouldn't use x here,

155
00:07:20,260 --> 00:07:22,790
because x is going
to be my actual data.

156
00:07:22,790 --> 00:07:23,920
So let me write y.

157
00:07:27,988 --> 00:07:29,446
OK?

158
00:07:29,446 --> 00:07:32,790
So now what is this guy here?

159
00:07:32,790 --> 00:07:35,430
It's basically-- so phi hat.

160
00:07:35,430 --> 00:07:42,360
So this implies that phi mu
hat sigma hat squared of t

161
00:07:42,360 --> 00:07:46,680
is equal to the probability
that sigma hat z

162
00:07:46,680 --> 00:07:50,220
plus mu hat is
less than t, which

163
00:07:50,220 --> 00:07:53,940
is equal to the probability
that z is less than t

164
00:07:53,940 --> 00:08:00,550
minus mu hat divided
by sigma hat, right?

165
00:08:00,550 --> 00:08:02,470
But now when z is
the standard normal,

166
00:08:02,470 --> 00:08:04,540
this is really just the
cumulative distribution

167
00:08:04,540 --> 00:08:07,450
function of a standard
Gaussian but evaluated

168
00:08:07,450 --> 00:08:09,660
at a point which is
not t, but t minus mu

169
00:08:09,660 --> 00:08:11,270
hat divided by sigma hat.

170
00:08:11,270 --> 00:08:11,770
All right?

171
00:08:11,770 --> 00:08:15,525
So in particular, what I know--
so from this what I get-- well,

172
00:08:15,525 --> 00:08:17,650
maybe I'll remove that,
it's going to be annoying--

173
00:08:17,650 --> 00:08:23,182
I know that phi mu hat
sigma hat squared--

174
00:08:23,182 --> 00:08:27,000
sorry-- phi mu hat
sigma hat squared of t

175
00:08:27,000 --> 00:08:31,230
is simply phi of, say, 0, 1.

176
00:08:31,230 --> 00:08:32,559
And that's just the notation.

177
00:08:32,559 --> 00:08:35,730
Usually we don't put those,
but here it's more convenient.

178
00:08:35,730 --> 00:08:43,318
So it's phi 0, 1 of t minus
mu hat divided by sigma hat.

179
00:08:43,318 --> 00:08:45,660
OK?

180
00:08:45,660 --> 00:08:48,540
That's just something
you can quickly check.

181
00:08:48,540 --> 00:08:51,960
There's this nice way of writing
the cumulative distribution

182
00:08:51,960 --> 00:08:55,050
function for any
mean and any variance

183
00:08:55,050 --> 00:08:57,360
in terms of the cumulative
distribution function

184
00:08:57,360 --> 00:08:59,161
with mean 0 and variance 1.

185
00:08:59,161 --> 00:08:59,660
All right?

186
00:08:59,660 --> 00:09:00,795
Not too complicated.

187
00:09:00,795 --> 00:09:01,295
All right.

188
00:09:01,295 --> 00:09:04,791
So I know what I'm going to say
is that, OK, I have this sup

189
00:09:04,791 --> 00:09:05,290
here.

190
00:09:05,290 --> 00:09:07,620
So what I can write is
that this thing here

191
00:09:07,620 --> 00:09:12,690
is equal to the sup
routine R of 1/n.

192
00:09:12,690 --> 00:09:14,400
Let me write what Fn is--

193
00:09:14,400 --> 00:09:17,360
sum from i equal 1
to n of the indicator

194
00:09:17,360 --> 00:09:23,910
that xi is less than
t minus phi 0, 1

195
00:09:23,910 --> 00:09:27,050
of t minus mu hat
divided by sigma hat.

196
00:09:30,459 --> 00:09:32,410
OK?

197
00:09:32,410 --> 00:09:34,770
I actually want to make
a change of variable

198
00:09:34,770 --> 00:09:36,700
so that this thing
I'm going to call mu--

199
00:09:36,700 --> 00:09:37,990
u, sorry.

200
00:09:37,990 --> 00:09:38,680
OK?

201
00:09:38,680 --> 00:09:40,660
And so I'm going to
make my life easier,

202
00:09:40,660 --> 00:09:42,830
and I'm going to
make it appear here.

203
00:09:42,830 --> 00:09:46,700
And so I'm just going to
replace this by indicator

204
00:09:46,700 --> 00:09:52,480
that xi minus mu hat divided
by sigma hat less than t

205
00:09:52,480 --> 00:09:56,110
minus mu hat divided
by sigma hat, which is

206
00:09:56,110 --> 00:09:57,880
sort of useless at this point.

207
00:09:57,880 --> 00:10:00,100
I'm just making my
formula more complicated.

208
00:10:00,100 --> 00:10:02,080
But now I see something
here that shows up,

209
00:10:02,080 --> 00:10:06,952
and I will call it u,
and this is another u.

210
00:10:06,952 --> 00:10:08,450
OK?

211
00:10:08,450 --> 00:10:12,230
So now what it means is that
suping over t, when t ranges

212
00:10:12,230 --> 00:10:15,290
from negative infinity
to plus infinity,

213
00:10:15,290 --> 00:10:17,980
the new range is from negative
infinity to plus infinity,

214
00:10:17,980 --> 00:10:20,130
right?

215
00:10:20,130 --> 00:10:22,050
So this sup, I can
actually write--

216
00:10:22,050 --> 00:10:34,256
this suping t I can
write as the sup in u,

217
00:10:34,256 --> 00:10:38,305
as the indicator that xi minus
mu hat divided by sigma hat

218
00:10:38,305 --> 00:10:47,380
is less than u
minus phi 0, 1 of u.

219
00:10:47,380 --> 00:10:49,600
Now, let's pause for one second.

220
00:10:49,600 --> 00:10:51,130
Let's see where we're going.

221
00:10:51,130 --> 00:10:53,230
What we're trying to show
that this thing does not

222
00:10:53,230 --> 00:10:57,520
depend on the unknown
parameters, say, mu and sigma,

223
00:10:57,520 --> 00:11:01,800
which are the mean and the
variance of x under the null.

224
00:11:01,800 --> 00:11:04,050
To do that, we
basically need to make

225
00:11:04,050 --> 00:11:09,340
only quantities that are sort
of invariant under these values.

226
00:11:09,340 --> 00:11:11,789
So I tried to make this thing
invariant under anything,

227
00:11:11,789 --> 00:11:14,080
and it's just really something
that depends on nothing.

228
00:11:14,080 --> 00:11:15,570
It's the CDF.

229
00:11:15,570 --> 00:11:18,600
It doesn't depend on sigma
hat and mu hat anymore.

230
00:11:18,600 --> 00:11:22,399
But sigma hat and mu hat will
depend on mu and sigma, right?

231
00:11:22,399 --> 00:11:24,690
I mean, they're actually good
estimators of those guys,

232
00:11:24,690 --> 00:11:26,650
so they should be
pretty close to them.

233
00:11:26,650 --> 00:11:28,650
And so I need to make
sure that I'm not actually

234
00:11:28,650 --> 00:11:30,970
doing anything wrong here.

235
00:11:30,970 --> 00:11:35,400
So the key thing here is going
to be to observe that 1/n sum

236
00:11:35,400 --> 00:11:40,380
from i equal 1 to n of indicator
of xi minus u hat divided

237
00:11:40,380 --> 00:11:43,650
by sigma hat less than u, which
is the first term that I have

238
00:11:43,650 --> 00:11:48,720
in this absolute value,
well, this is what-- well,

239
00:11:48,720 --> 00:11:54,040
this is equal to 1/n sum from
i equal 1 to n of indicator

240
00:11:54,040 --> 00:11:55,440
that--

241
00:11:55,440 --> 00:12:00,810
well, now under
the null, which is

242
00:12:00,810 --> 00:12:06,309
that x follows N mu sigma
squared, for some mu and sigma

243
00:12:06,309 --> 00:12:07,350
squared that are unknown.

244
00:12:07,350 --> 00:12:08,190
But they are here.

245
00:12:08,190 --> 00:12:08,940
They exist.

246
00:12:08,940 --> 00:12:10,800
I just don't know what they are.

247
00:12:10,800 --> 00:12:17,820
Then xi minus mu can be
written as sigma zi plus mu

248
00:12:17,820 --> 00:12:23,340
minus mu hat divided
by sigma hat, where

249
00:12:23,340 --> 00:12:29,150
z is equal to x minus mu
divided by sigma, right?

250
00:12:29,150 --> 00:12:32,750
That's just the same
trick that I wrote here.

251
00:12:32,750 --> 00:12:33,494
OK?

252
00:12:33,494 --> 00:12:34,160
Everybody agree?

253
00:12:34,160 --> 00:12:36,820
So I just standardize--

254
00:12:36,820 --> 00:12:42,080
sorry, z-- yeah, so zi is xi
minus mu i minus mu divided

255
00:12:42,080 --> 00:12:42,721
by sigma.

256
00:12:42,721 --> 00:12:43,220
All right?

257
00:12:43,220 --> 00:12:45,020
Just a standardization.

258
00:12:45,020 --> 00:12:47,780
So now once I write
this, I can actually

259
00:12:47,780 --> 00:12:49,250
divide everybody by sigma.

260
00:12:55,246 --> 00:12:55,746
Right?

261
00:12:55,746 --> 00:12:59,984
So I just divided on top
here and in the bottom here.

262
00:12:59,984 --> 00:13:02,150
So now what I need to check
is that the distribution

263
00:13:02,150 --> 00:13:08,790
of this guy does not
depend on mu or sigma.

264
00:13:08,790 --> 00:13:10,200
That's what I claim.

265
00:13:10,200 --> 00:13:12,330
What is the distribution
of this indicator?

266
00:13:16,360 --> 00:13:19,070
It's a Bernoulli, right?

267
00:13:19,070 --> 00:13:21,260
And so if I want to
understand its distribution,

268
00:13:21,260 --> 00:13:23,680
all I need to do is to
compute its expectation,

269
00:13:23,680 --> 00:13:26,199
which is just the probability
that this thing happens.

270
00:13:26,199 --> 00:13:27,990
But the probability
that this thing happens

271
00:13:27,990 --> 00:13:29,880
is actually now depending
on mu and sigma.

272
00:13:29,880 --> 00:13:33,310
And the reason is
that mu is what?

273
00:13:33,310 --> 00:13:44,510
Well, it's x bar-- sorry, yeah,
so mu hat-- sorry, is xn bar.

274
00:13:44,510 --> 00:13:50,730
So mu hat minus mu,
which under the null

275
00:13:50,730 --> 00:13:54,810
follows N mu sigma
square over n, right?

276
00:13:54,810 --> 00:13:57,370
That's the property
of the average.

277
00:13:57,370 --> 00:14:00,990
So when I do mu hat minus
mu divided by sigma,

278
00:14:00,990 --> 00:14:04,820
this thing is what distribution?

279
00:14:04,820 --> 00:14:05,720
It's still a normal.

280
00:14:05,720 --> 00:14:07,428
It's a linear
transformation of a normal.

281
00:14:07,428 --> 00:14:11,006
What are the parameters?

282
00:14:11,006 --> 00:14:11,877
AUDIENCE: 0, 1/n.

283
00:14:11,877 --> 00:14:13,210
PHILIPPE RIGOLLET: Yeah, 0, 1/n.

284
00:14:16,190 --> 00:14:26,910
But this does not depend
on mu or sigma, right?

285
00:14:29,520 --> 00:14:31,440
Now, I need to check
that this guy does not

286
00:14:31,440 --> 00:14:34,920
depend on mu or sigma.

287
00:14:34,920 --> 00:14:37,680
What is the distribution
of sigma hat over sigma?

288
00:14:40,389 --> 00:14:41,847
AUDIENCE: It's a
chi-square, right?

289
00:14:41,847 --> 00:14:43,680
PHILIPPE RIGOLLET: Yeah,
it is a chi-square.

290
00:14:43,680 --> 00:14:45,490
So this is actually--

291
00:14:45,490 --> 00:14:48,700
sorry, sigma hat squared
divided by sigma squared

292
00:14:48,700 --> 00:14:54,309
is a chi-square with n
minus 1 degrees of freedom.

293
00:14:54,309 --> 00:14:55,600
Does not depend on mu or sigma.

294
00:15:00,440 --> 00:15:02,860
AUDIENCE: [INAUDIBLE]

295
00:15:02,860 --> 00:15:03,828
AUDIENCE: [INAUDIBLE]

296
00:15:03,828 --> 00:15:05,992
AUDIENCE: Or sigma hat
squared over sigma squared?

297
00:15:05,992 --> 00:15:07,450
PHILIPPE RIGOLLET:
Yeah, thank you.

298
00:15:07,450 --> 00:15:10,810
So this is actually
divided by it.

299
00:15:10,810 --> 00:15:11,629
So maybe this guy.

300
00:15:11,629 --> 00:15:12,670
Let's write it like that.

301
00:15:12,670 --> 00:15:14,211
This is the proper
way of writing it.

302
00:15:14,211 --> 00:15:14,966
Thank you.

303
00:15:20,389 --> 00:15:21,649
Right?

304
00:15:21,649 --> 00:15:22,940
So now I have those two things.

305
00:15:22,940 --> 00:15:25,680
Neither of them
depends on mu or sigma.

306
00:15:25,680 --> 00:15:28,079
I these two things.

307
00:15:28,079 --> 00:15:29,620
There's just one
more thing to check.

308
00:15:32,191 --> 00:15:32,690
What is it?

309
00:15:35,342 --> 00:15:36,800
AUDIENCE: That
they're independent?

310
00:15:36,800 --> 00:15:37,790
PHILIPPE RIGOLLET: That
they're independent, right?

311
00:15:37,790 --> 00:15:39,373
Because the dependence
in mu and sigma

312
00:15:39,373 --> 00:15:41,090
could be hidden
in the covariance.

313
00:15:41,090 --> 00:15:44,540
It could be the case that the
marginal distribution of mu

314
00:15:44,540 --> 00:15:47,150
does not depend on mu or sigma,
that the marginal distribution

315
00:15:47,150 --> 00:15:48,100
of sigma--

316
00:15:48,100 --> 00:15:49,850
of mu hat does not
depend on mu and sigma.

317
00:15:49,850 --> 00:15:51,730
The marginal
distribution of sigma hat

318
00:15:51,730 --> 00:15:54,290
does not depend on mu or
sigma, but their correlation

319
00:15:54,290 --> 00:15:56,000
could depend on mu and sigma.

320
00:15:56,000 --> 00:15:59,100
But we also have
that if I look at--

321
00:15:59,100 --> 00:15:59,900
so if I look at--

322
00:16:02,920 --> 00:16:10,330
so since mu hat is
independent of sigma hat,

323
00:16:10,330 --> 00:16:33,770
it means that the joint
distribution of mu hat divided

324
00:16:33,770 --> 00:16:38,180
by sigma and sigma
hat divided by sigma

325
00:16:38,180 --> 00:16:46,790
does not depend on blah,
blah, blah, on mu and sigma.

326
00:16:46,790 --> 00:16:47,290
OK?

327
00:16:50,170 --> 00:16:52,580
Agree?

328
00:16:52,580 --> 00:16:54,940
It's not in the individual
ones, and it's not

329
00:16:54,940 --> 00:16:57,940
in the way they interact
with each other.

330
00:16:57,940 --> 00:16:59,743
It's nowhere.

331
00:16:59,743 --> 00:17:01,951
AUDIENCE: [INAUDIBLE]
independence be [INAUDIBLE]

332
00:17:01,951 --> 00:17:02,450
theorem?

333
00:17:02,450 --> 00:17:03,750
PHILIPPE RIGOLLET: Yeah,
covariance theorem, right.

334
00:17:03,750 --> 00:17:06,359
So that's something we've
been using over and again.

335
00:17:06,359 --> 00:17:07,960
That's all under the null.

336
00:17:07,960 --> 00:17:12,790
If my data is not Gaussian,
nothing actually holds.

337
00:17:12,790 --> 00:17:14,440
I just use the fact
that under the null

338
00:17:14,440 --> 00:17:17,624
I'm Gaussian for some mean mu
and variance sigma squared.

339
00:17:17,624 --> 00:17:18,790
But that's all I care about.

340
00:17:18,790 --> 00:17:21,430
When I'm designing
a test, I only

341
00:17:21,430 --> 00:17:24,910
care about the distribution
under the null, at least

342
00:17:24,910 --> 00:17:26,420
to control the type I error.

343
00:17:26,420 --> 00:17:28,270
Then to control
the type II error,

344
00:17:28,270 --> 00:17:31,610
then I cross my
fingers pretty hard.

345
00:17:31,610 --> 00:17:32,110
OK?

346
00:17:34,710 --> 00:17:41,420
So now this basically implies
what's written on the board,

347
00:17:41,420 --> 00:17:45,270
that this distribution,
this test statistic,

348
00:17:45,270 --> 00:17:48,270
does not depend on any
unknown parameters.

349
00:17:48,270 --> 00:17:50,160
It's just something
that's pivotal.

350
00:17:50,160 --> 00:17:53,160
In particular, I could
go at the back of a book

351
00:17:53,160 --> 00:17:56,220
and check if there's a table for
the quantiles of these things,

352
00:17:56,220 --> 00:17:58,260
and indeed there are.

353
00:17:58,260 --> 00:18:00,884
This is the table that you see.

354
00:18:00,884 --> 00:18:02,550
So actually, this is
not even in a book.

355
00:18:02,550 --> 00:18:09,970
This is in Lilliefors
original paper, 1967,

356
00:18:09,970 --> 00:18:13,050
as you can tell from
the typewriting.

357
00:18:13,050 --> 00:18:17,190
And he actually probably
was rolling some dice

358
00:18:17,190 --> 00:18:19,920
from his office back in
the day and was checking

359
00:18:19,920 --> 00:18:22,380
that this was-- he
simulated it, and this is

360
00:18:22,380 --> 00:18:24,120
how he computed those numbers.

361
00:18:24,120 --> 00:18:28,440
And here you also have
some limiting distribution,

362
00:18:28,440 --> 00:18:31,180
which is not the sup of
a Brownian motion over 0,

363
00:18:31,180 --> 00:18:35,034
1 of-- sorry, of a
Brownian bridge over 0,

364
00:18:35,034 --> 00:18:36,450
1, which is the
one that you would

365
00:18:36,450 --> 00:18:38,900
see for the
Kolmogorov-Smirnov test,

366
00:18:38,900 --> 00:18:41,050
but it's something that's
slightly different.

367
00:18:41,050 --> 00:18:45,727
And as I said, these numbers are
actually typically much smaller

368
00:18:45,727 --> 00:18:47,310
than the numbers you
would get, right?

369
00:18:47,310 --> 00:18:50,280
Remember, we got something
that was about 0.5, I think,

370
00:18:50,280 --> 00:18:54,350
or maybe 0.41, for the
Kolmogorov-Smirnov test

371
00:18:54,350 --> 00:18:56,070
at the same
entrance, which means

372
00:18:56,070 --> 00:18:58,452
that using
Kolmogorov-Lilliefors test

373
00:18:58,452 --> 00:18:59,910
it's going to be
harder for you not

374
00:18:59,910 --> 00:19:02,460
to reject for the same data.

375
00:19:02,460 --> 00:19:04,860
It might be the case that
in one case you reject,

376
00:19:04,860 --> 00:19:06,660
and in the other one
you fail to reject.

377
00:19:06,660 --> 00:19:09,710
But the ordering is
always that if you

378
00:19:09,710 --> 00:19:12,410
fail to reject with
Kolmogorov-Lilliefors,

379
00:19:12,410 --> 00:19:17,370
you will fail to reject with
Kolmogorov-Smirnov, right?

380
00:19:17,370 --> 00:19:18,720
There's always one.

381
00:19:18,720 --> 00:19:20,850
So that's why people
tend to close their eyes

382
00:19:20,850 --> 00:19:23,160
and prefer Kolmogorov-Smirnov
because it just

383
00:19:23,160 --> 00:19:25,365
makes their life easier.

384
00:19:25,365 --> 00:19:27,690
OK?

385
00:19:27,690 --> 00:19:29,930
So this is called
Kolmogorov-Lilliefors.

386
00:19:29,930 --> 00:19:33,250
I think there's
actually an E here--

387
00:19:33,250 --> 00:19:41,440
sorry, an I before the E.
Doesn't matter too much.

388
00:19:41,440 --> 00:19:42,000
OK?

389
00:19:42,000 --> 00:19:43,000
Are there any questions?

390
00:19:43,000 --> 00:19:43,932
Yes?

391
00:19:43,932 --> 00:19:45,390
AUDIENCE: Is there
like a place you

392
00:19:45,390 --> 00:19:59,135
can point to like [INAUDIBLE]

393
00:19:59,135 --> 00:20:00,135
PHILIPPE RIGOLLET: Yeah.

394
00:20:00,135 --> 00:20:01,540
AUDIENCE: [INAUDIBLE].

395
00:20:01,540 --> 00:20:03,581
PHILIPPE RIGOLLET: So the
fact that it's actually

396
00:20:03,581 --> 00:20:07,120
a different distribution
is that here--

397
00:20:07,120 --> 00:20:11,970
so if I actually knew
what mu and sigma were,

398
00:20:11,970 --> 00:20:13,800
I would do exactly
the same thing.

399
00:20:13,800 --> 00:20:16,350
But here, rather than having
this average with mu and sigma,

400
00:20:16,350 --> 00:20:17,516
I would just have the--

401
00:20:17,516 --> 00:20:19,140
with mu hat and sigma
hat, I would just

402
00:20:19,140 --> 00:20:20,760
have the average
with mu and sigma.

403
00:20:20,760 --> 00:20:21,630
OK?

404
00:20:21,630 --> 00:20:23,700
So what it means is
that the key thing

405
00:20:23,700 --> 00:20:29,790
is that what I would compare is
the 1/n sum of some Bernoullis

406
00:20:29,790 --> 00:20:30,600
with parameter.

407
00:20:30,600 --> 00:20:34,920
And the parameter here would
be the probability that mu--

408
00:20:34,920 --> 00:20:37,830
xi minus mu over
sigma is less than u,

409
00:20:37,830 --> 00:20:40,500
which is just the
probability that phi--

410
00:20:40,500 --> 00:20:44,340
sorry, it's a Bernoulli
with probability F of t.

411
00:20:44,340 --> 00:20:49,590
Well, let me write
what it is, right?

412
00:20:49,590 --> 00:20:57,440
So that's minus phi 0, 1 of t.

413
00:20:57,440 --> 00:20:57,940
OK?

414
00:20:57,940 --> 00:21:04,160
So that's for the K-S test,
and then I sup over t, right?

415
00:21:04,160 --> 00:21:06,590
That's what I would have
had, because this is actually

416
00:21:06,590 --> 00:21:08,100
exactly the right thing.

417
00:21:08,100 --> 00:21:10,130
Here I would remove
the true mean.

418
00:21:10,130 --> 00:21:12,150
I would divide by the
true standard deviation.

419
00:21:12,150 --> 00:21:15,320
So that would actually end
up being a standard Gaussian,

420
00:21:15,320 --> 00:21:18,520
and that's why I'm allowed
to use phi 0, 1 here.

421
00:21:18,520 --> 00:21:19,690
Agreed?

422
00:21:19,690 --> 00:21:22,220
And these are Bernoullis
because they're just indicators.

423
00:21:22,220 --> 00:21:26,860
What happens in the
Kolmogorov-Lilliefors test?

424
00:21:26,860 --> 00:21:28,810
Well, here the
Bernoulli, the only thing

425
00:21:28,810 --> 00:21:30,150
that's going to change
is this guy, right?

426
00:21:30,150 --> 00:21:31,390
They still have a Bernoulli.

427
00:21:31,390 --> 00:21:34,000
It's just that the parameters
of the Bernoulli are weird.

428
00:21:34,000 --> 00:21:37,066
The parameters of the
Bernoulli looks like it's--

429
00:21:37,066 --> 00:21:47,735
it becomes the probability that
some N(0, 1) plus some N(0,

430
00:21:47,735 --> 00:22:02,140
1/n), right, divided by some
square root of chi-squared n

431
00:22:02,140 --> 00:22:07,550
minus 1 divided by
n is less than t.

432
00:22:07,550 --> 00:22:09,380
And those things
are independent,

433
00:22:09,380 --> 00:22:12,480
but those guys are not
necessarily independent, right?

434
00:22:12,480 --> 00:22:14,720
And so why is this
probability changing?

435
00:22:14,720 --> 00:22:17,480
Well, because this denominator
is actually fluctuating a lot.

436
00:22:17,480 --> 00:22:20,570
So that actually makes
this probability different.

437
00:22:20,570 --> 00:22:23,940
And so that's basically
where it comes from, right?

438
00:22:23,940 --> 00:22:26,940
So you could probably
convince yourself

439
00:22:26,940 --> 00:22:32,520
very quickly that this only
makes those guys closer.

440
00:22:32,520 --> 00:22:38,280
And why does it make
those guys closer?

441
00:22:40,700 --> 00:22:41,200
No, sorry.

442
00:22:41,200 --> 00:22:43,230
It makes those guys
farther, right?

443
00:22:43,230 --> 00:22:46,000
And it makes those guys farther
for a very clear reason,

444
00:22:46,000 --> 00:22:51,080
is that the expectation of this
Bernoulli is exactly that guy.

445
00:22:51,080 --> 00:22:52,794
Here I think it's
going to be true

446
00:22:52,794 --> 00:22:54,710
as well that the expectation
of this Bernoulli

447
00:22:54,710 --> 00:22:56,690
is going to be that guy,
but the fluctuations

448
00:22:56,690 --> 00:22:58,700
are going to be much bigger than
just the phi of the Bernoulli.

449
00:22:58,700 --> 00:22:59,780
Because the first
thing I do is I

450
00:22:59,780 --> 00:23:01,550
have a random parameter
from my Bernoulli,

451
00:23:01,550 --> 00:23:02,830
and then I flip the Bernoulli.

452
00:23:02,830 --> 00:23:04,670
So fluctuations are going to
be bigger than a Bernoulli.

453
00:23:04,670 --> 00:23:06,211
And so when I take
the sup, I'm going

454
00:23:06,211 --> 00:23:07,960
to have to [INAUDIBLE] them.

455
00:23:07,960 --> 00:23:09,686
So it makes things
farther apart,

456
00:23:09,686 --> 00:23:11,560
which makes it more
likely for you to reject.

457
00:23:11,560 --> 00:23:12,060
Yeah?

458
00:23:12,060 --> 00:23:16,778
AUDIENCE: You also said that if
you compare the same-- if you

459
00:23:16,778 --> 00:23:19,512
compare the table and you
set at the same level,

460
00:23:19,512 --> 00:23:24,476
the Lilliefors is like 0.2,
and for the Smirnov is at 0.4.

461
00:23:24,476 --> 00:23:25,476
PHILIPPE RIGOLLET: Yeah.

462
00:23:25,476 --> 00:23:26,017
AUDIENCE: OK.

463
00:23:26,017 --> 00:23:30,022
So it means that Lilliefors
is harder not to reject?

464
00:23:30,022 --> 00:23:32,230
PHILIPPE RIGOLLET: It means
that Lilliefors is harder

465
00:23:32,230 --> 00:23:35,020
not to reject, yes,
because we reject when

466
00:23:35,020 --> 00:23:36,490
we're larger than the number.

467
00:23:36,490 --> 00:23:39,900
So the number being smaller
with the same data, we might be,

468
00:23:39,900 --> 00:23:40,400
right?

469
00:23:40,400 --> 00:23:43,750
So basically, it
looks like this.

470
00:23:43,750 --> 00:23:55,750
What we run-- so here we have
the distribution for the--

471
00:23:55,750 --> 00:24:05,510
so let's say this is
the density for K-S.

472
00:24:05,510 --> 00:24:11,270
And then we have the density for
Kolmogorov-Lilliefors, K-L. OK?

473
00:24:11,270 --> 00:24:13,040
And what the density
of K-L looks like,

474
00:24:13,040 --> 00:24:22,270
it looks like this, right?

475
00:24:22,270 --> 00:24:27,860
And so if I want to
squeeze in alpha here,

476
00:24:27,860 --> 00:24:30,260
I'm going to have to squeeze
in-- and I squeeze in alpha

477
00:24:30,260 --> 00:24:34,740
here, then this is the
quantile of order 1 minus alp--

478
00:24:34,740 --> 00:24:38,270
well, let's say
alpha of the K-L.

479
00:24:38,270 --> 00:24:41,750
And this is the
quantile alpha of K-S.

480
00:24:41,750 --> 00:24:44,270
So now you give me data,
and what I do with it,

481
00:24:44,270 --> 00:24:46,310
I check whether they're
larger than this number.

482
00:24:46,310 --> 00:24:48,800
So if I apply K-S, I check
whether I'm larger or smaller

483
00:24:48,800 --> 00:24:49,599
than this thing.

484
00:24:49,599 --> 00:24:51,140
But if I apply
Kolmogorov-Lilliefors,

485
00:24:51,140 --> 00:24:53,390
I check whether I'm larger
or smaller than this thing.

486
00:24:53,390 --> 00:24:56,270
So over this entire range of
values for my test statistic--

487
00:24:56,270 --> 00:24:58,354
because it is the
same test statistic,

488
00:24:58,354 --> 00:25:00,020
I just plugged in mu
hat and sigma hat--

489
00:25:00,020 --> 00:25:04,229
for this entire range, the two
tests have different outcomes.

490
00:25:04,229 --> 00:25:06,020
And this is a big range
in practice, right?

491
00:25:06,020 --> 00:25:08,210
I mean, it's between--

492
00:25:08,210 --> 00:25:10,670
I mean, it's pretty
much at scale here.

493
00:25:13,614 --> 00:25:14,114
OK?

494
00:25:18,050 --> 00:25:18,994
Any other-- yeah?

495
00:25:18,994 --> 00:25:21,494
AUDIENCE: [INAUDIBLE] when n
goes to infinity, the two tests

496
00:25:21,494 --> 00:25:24,446
become the same now, right?

497
00:25:24,446 --> 00:25:25,922
PHILIPPE RIGOLLET: Hmmm.

498
00:25:25,922 --> 00:25:27,381
AUDIENCE: Looking
at that formula--

499
00:25:27,381 --> 00:25:29,546
PHILIPPE RIGOLLET: Yeah,
They should become the same

500
00:25:29,546 --> 00:25:30,130
very far.

501
00:25:32,740 --> 00:25:34,740
Let me see, though, because--

502
00:25:34,740 --> 00:25:35,240
right.

503
00:25:35,240 --> 00:25:38,100
So here we have 8--

504
00:25:38,100 --> 00:25:44,030
so here we have, say,
for 0.5, we get 0.886.

505
00:25:44,030 --> 00:25:45,940
And for-- oh, I don't have it.

506
00:25:49,540 --> 00:25:50,715
Yeah, actually, sorry.

507
00:25:50,715 --> 00:25:51,760
So you're right.

508
00:25:51,760 --> 00:25:52,900
You're totally right.

509
00:25:52,900 --> 00:25:56,650
This is the Brownian
bridge values.

510
00:25:56,650 --> 00:26:00,950
Because in the limit
by, say, Slutsky--

511
00:26:00,950 --> 00:26:02,070
sorry, I'm lost.

512
00:26:02,070 --> 00:26:03,694
Yeah, these are
the values that you

513
00:26:03,694 --> 00:26:04,860
get for the Brownian bridge.

514
00:26:04,860 --> 00:26:07,620
Because in the limit
by Slutsky, this thing

515
00:26:07,620 --> 00:26:09,720
is going to have no
fluctuation, and this thing

516
00:26:09,720 --> 00:26:11,054
is going to have no fluctuation.

517
00:26:11,054 --> 00:26:12,719
So they're just going
to be pinned down,

518
00:26:12,719 --> 00:26:15,390
and it's going to look like as
if I did not replace anything.

519
00:26:15,390 --> 00:26:18,980
Because in the limit, I know
those guys much faster--

520
00:26:18,980 --> 00:26:20,840
the mu hat and
sigma hat converge

521
00:26:20,840 --> 00:26:25,500
much faster to mu and sigma than
the distribution itself, right?

522
00:26:25,500 --> 00:26:27,375
So those are actually
going to be negligible.

523
00:26:27,375 --> 00:26:29,800
You're right.

524
00:26:29,800 --> 00:26:31,330
Actually even, I didn't have--

525
00:26:31,330 --> 00:26:32,800
these are actually the
numbers I showed you

526
00:26:32,800 --> 00:26:34,030
for the bridge, the
Brownian bridge,

527
00:26:34,030 --> 00:26:36,620
last time, because I didn't have
it for the Kolmogorov-Smirnov

528
00:26:36,620 --> 00:26:38,166
one.

529
00:26:38,166 --> 00:26:38,666
OK?

530
00:26:41,480 --> 00:26:44,990
So there's actually-- so those
are numerical ways of checking

531
00:26:44,990 --> 00:26:45,740
things, right?

532
00:26:45,740 --> 00:26:47,350
I give you data.

533
00:26:47,350 --> 00:26:50,750
You just crank the
Kolmogorov-Smirnov test.

534
00:26:50,750 --> 00:26:52,580
Usually you press a 5 on MATLAB.

535
00:26:52,580 --> 00:26:55,940
But let's say you actually
compute this entire thing,

536
00:26:55,940 --> 00:26:57,740
and there's a number
that comes out,

537
00:26:57,740 --> 00:27:00,366
and you decide whether it's
large enough or small enough.

538
00:27:00,366 --> 00:27:02,990
Of course, statistical software
is going to make your life even

539
00:27:02,990 --> 00:27:05,450
simpler by spitting out a
p-value, because you can--

540
00:27:05,450 --> 00:27:07,533
I mean, if you can compute
quantiles, you can also

541
00:27:07,533 --> 00:27:09,060
when compute p-values.

542
00:27:09,060 --> 00:27:12,530
And so your life is
just fairly easy.

543
00:27:12,530 --> 00:27:18,630
You just have red is bad, green
is good, and then you can go.

544
00:27:18,630 --> 00:27:21,390
The problem is that those are
numbers you want to rely on.

545
00:27:21,390 --> 00:27:23,085
But let's say you
actually reject.

546
00:27:23,085 --> 00:27:23,960
Let's say you reject.

547
00:27:23,960 --> 00:27:29,240
Your p-value is actually
just like slightly below 5%.

548
00:27:29,240 --> 00:27:33,910
So you can say, well, maybe
I'm just going to change

549
00:27:33,910 --> 00:27:34,880
my p-value--

550
00:27:34,880 --> 00:27:36,950
my threshold to
1%, but you might

551
00:27:36,950 --> 00:27:38,360
want to see what's happening.

552
00:27:38,360 --> 00:27:40,280
And for that you need
a visual diagnostic.

553
00:27:40,280 --> 00:27:42,710
Like, how do I check
if something departs

554
00:27:42,710 --> 00:27:44,580
from being normal, for example?

555
00:27:44,580 --> 00:27:46,770
How do I check if
a distribution--

556
00:27:46,770 --> 00:27:49,530
why is a distribution not
a uniform distribution?

557
00:27:49,530 --> 00:27:51,900
Why is a distribution not
an exponential distribution?

558
00:27:51,900 --> 00:27:53,060
There's many, many, right?

559
00:27:53,060 --> 00:27:54,660
If I have an
exponential distribution

560
00:27:54,660 --> 00:27:57,060
and half of my
values are negative,

561
00:27:57,060 --> 00:27:59,900
for example, well, there's
like pretty obvious reasons

562
00:27:59,900 --> 00:28:01,880
why it should not
be exponential.

563
00:28:01,880 --> 00:28:03,780
But it could be
the case that it's

564
00:28:03,780 --> 00:28:05,540
just the tails
are little heavier

565
00:28:05,540 --> 00:28:08,700
or there's more
concentration at some point.

566
00:28:08,700 --> 00:28:10,020
Maybe it has two modes.

567
00:28:10,020 --> 00:28:11,610
There's things like this.

568
00:28:11,610 --> 00:28:13,910
But the real thing,
we don't believe

569
00:28:13,910 --> 00:28:16,400
that the Gaussian is
so important because it

570
00:28:16,400 --> 00:28:19,280
looks like this close to 0.

571
00:28:19,280 --> 00:28:22,310
What we like about the
Gaussian is that the tails here

572
00:28:22,310 --> 00:28:24,740
decay at this rate--
exponential minus x

573
00:28:24,740 --> 00:28:28,445
squared over 2 that we described
in the maybe first lecture.

574
00:28:28,445 --> 00:28:31,580
And in particular, if there
were like kinks around here,

575
00:28:31,580 --> 00:28:33,260
it wouldn't matter too much.

576
00:28:33,260 --> 00:28:36,670
This is not what makes
issues for the Gaussian.

577
00:28:36,670 --> 00:28:41,860
And so what we want is to have a
visual diagnostic that tells us

578
00:28:41,860 --> 00:28:44,890
if the tails of my
distribution are

579
00:28:44,890 --> 00:28:48,350
comparable to the tails of
a Gaussian one, for example.

580
00:28:48,350 --> 00:28:51,700
And those are what's called
quantile-quantile plots,

581
00:28:51,700 --> 00:28:54,250
and in particular-- or QQ plots.

582
00:28:54,250 --> 00:28:58,090
And the basic QQ plots
we're going to be using

583
00:28:58,090 --> 00:29:00,640
are the ones that are
called normal QQ plots that

584
00:29:00,640 --> 00:29:03,310
are comparing your data to
a Gaussian distribution,

585
00:29:03,310 --> 00:29:05,230
or a normal distribution.

586
00:29:05,230 --> 00:29:07,600
But in general, you could
be comparing your data

587
00:29:07,600 --> 00:29:09,426
to any distribution you want.

588
00:29:09,426 --> 00:29:11,050
And the way you do
this is by comparing

589
00:29:11,050 --> 00:29:14,860
the quantiles of your data,
the empirical quantiles,

590
00:29:14,860 --> 00:29:16,840
to the quantiles of
the actual distribution

591
00:29:16,840 --> 00:29:19,020
you're trying to
compare yourself to.

592
00:29:19,020 --> 00:29:22,470
So this, in a way,
is a visual way

593
00:29:22,470 --> 00:29:25,200
of performing these
goodness-of-fit tests.

594
00:29:25,200 --> 00:29:29,040
And what's nice about visual is
that there's room for debate.

595
00:29:29,040 --> 00:29:31,311
You can see something that
somebody else cannot see,

596
00:29:31,311 --> 00:29:33,810
and you can always-- because
you want to say that things are

597
00:29:33,810 --> 00:29:34,300
Gaussian.

598
00:29:34,300 --> 00:29:36,674
And we'll see some examples
where you can actually say it

599
00:29:36,674 --> 00:29:41,790
if you are good at
debate, but it's actually

600
00:29:41,790 --> 00:29:44,230
going to be clearly not true.

601
00:29:44,230 --> 00:29:44,730
All right.

602
00:29:44,730 --> 00:29:46,722
So this is a quick
and easy check.

603
00:29:46,722 --> 00:29:48,180
That's something
I do all the time.

604
00:29:48,180 --> 00:29:49,710
You give me data, I'm
just going to run this.

605
00:29:49,710 --> 00:29:51,085
One of the first
things I do so I

606
00:29:51,085 --> 00:29:54,000
can check if I can start
entering the Gaussian

607
00:29:54,000 --> 00:29:57,930
world without compromising
myself too much.

608
00:29:57,930 --> 00:30:04,350
And the idea is to say, well,
if F is close to-- if F--

609
00:30:04,350 --> 00:30:07,590
if my data comes
from an F, and if I

610
00:30:07,590 --> 00:30:10,080
know that Fn is close
to F, then rather

611
00:30:10,080 --> 00:30:12,660
than computing some norm,
some number that tells me

612
00:30:12,660 --> 00:30:14,635
how far they are,
summarizing how far they are,

613
00:30:14,635 --> 00:30:16,260
I could actually plot
the two functions

614
00:30:16,260 --> 00:30:17,970
and see if they're far apart.

615
00:30:17,970 --> 00:30:21,870
So let's think for one second
what this kind of a plot

616
00:30:21,870 --> 00:30:23,380
would look like.

617
00:30:23,380 --> 00:30:25,090
Well, I would go
between 0 and 1.

618
00:30:25,090 --> 00:30:26,631
That's where everything
would happen.

619
00:30:26,631 --> 00:30:29,680
Let's say my distribution is
the Gaussian distribution.

620
00:30:29,680 --> 00:30:35,370
So this is the CDF of N(0, 1).

621
00:30:35,370 --> 00:30:37,950
And now I have this guy
that shows up, and remember

622
00:30:37,950 --> 00:30:39,540
we had this piecewise constant.

623
00:30:44,172 --> 00:30:46,130
Well, OK, let's say we
get something like this.

624
00:30:46,130 --> 00:30:51,770
We get a piecewise constant
distribution for Fn, right?

625
00:30:54,590 --> 00:31:00,200
Just from this, and even despite
my bad skills at drawing,

626
00:31:00,200 --> 00:31:01,880
it's clear that it's
going to be hard

627
00:31:01,880 --> 00:31:03,920
for you to distinguish
those two things,

628
00:31:03,920 --> 00:31:05,960
even for a fairly
large amount of points.

629
00:31:05,960 --> 00:31:08,750
Because the problem is
going to happen here,

630
00:31:08,750 --> 00:31:11,000
and those guys look pretty
much the same everywhere

631
00:31:11,000 --> 00:31:11,894
you are here.

632
00:31:11,894 --> 00:31:14,060
You're going to see differences
maybe in the middle,

633
00:31:14,060 --> 00:31:17,010
but we don't care too much
about those differences.

634
00:31:17,010 --> 00:31:19,191
And so what's going to
happen is that you're

635
00:31:19,191 --> 00:31:20,940
going to want to compare
those two things.

636
00:31:20,940 --> 00:31:23,660
And this is basically you
have the information you want,

637
00:31:23,660 --> 00:31:26,870
but visually it just doesn't
render very well because you're

638
00:31:26,870 --> 00:31:28,970
not scaling things properly.

639
00:31:28,970 --> 00:31:32,060
And the way we actually do it
is by flipping things around.

640
00:31:32,060 --> 00:31:36,230
And rather than comparing the
plot of F to the plot of Fn,

641
00:31:36,230 --> 00:31:38,300
we compare the
plot of Fn inverse

642
00:31:38,300 --> 00:31:41,444
to the plot of F inverse.

643
00:31:41,444 --> 00:31:47,050
Now, if F goes from the real
line to the interval 0, 1,

644
00:31:47,050 --> 00:31:52,409
F inverse goes from 0, 1
to the whole real line.

645
00:31:52,409 --> 00:31:53,950
So what's going to
happen is that I'm

646
00:31:53,950 --> 00:31:57,120
going to compare things on
some intervals, which is the--

647
00:31:57,120 --> 00:31:59,210
which are the entire real line.

648
00:31:59,210 --> 00:32:02,470
And then what values should I
be looking at those things at?

649
00:32:02,470 --> 00:32:05,920
Well, technically for
F, if F is continuous I

650
00:32:05,920 --> 00:32:09,730
could look at F inverse for
any value that I please, right?

651
00:32:09,730 --> 00:32:14,630
So I have F. And if I
want to look at F inverse,

652
00:32:14,630 --> 00:32:17,240
I pick a point here and I look
at the value that it gives me,

653
00:32:17,240 --> 00:32:23,054
and that's F inverse of,
say, u, right, if this is u.

654
00:32:23,054 --> 00:32:24,470
And I could pick
any value I want,

655
00:32:24,470 --> 00:32:25,920
I'm going to be able to find it.

656
00:32:25,920 --> 00:32:27,586
The problem is that
when I start to have

657
00:32:27,586 --> 00:32:30,860
this piecewise
constant thing, I need

658
00:32:30,860 --> 00:32:33,320
to decide what value I
assign for anything that's

659
00:32:33,320 --> 00:32:35,920
in between two jumps, right?

660
00:32:35,920 --> 00:32:38,340
And so I can choose
whatever I want,

661
00:32:38,340 --> 00:32:40,470
but in practice it's
just going to be things

662
00:32:40,470 --> 00:32:42,100
that I myself decide.

663
00:32:42,100 --> 00:32:44,580
Maybe I can decide
that this is the value.

664
00:32:44,580 --> 00:32:46,990
Maybe I can decide
that the value is here.

665
00:32:46,990 --> 00:32:49,590
But for all these guys, I'm
going to pretty much decide

666
00:32:49,590 --> 00:32:51,510
always the same value, right?

667
00:32:51,510 --> 00:32:52,530
If I'm in between--

668
00:32:52,530 --> 00:32:56,710
for this value u, for this
jump the jump is here.

669
00:32:56,710 --> 00:32:59,370
So for this value,
I'm going to be

670
00:32:59,370 --> 00:33:02,310
able to decide whether I
want to go above or below,

671
00:33:02,310 --> 00:33:05,160
but it's always this value
that's going to come out.

672
00:33:05,160 --> 00:33:07,380
So rather than picking
values that are in between,

673
00:33:07,380 --> 00:33:08,940
I might as well just
pick only values

674
00:33:08,940 --> 00:33:11,570
for which this is the value
that it's going to get.

675
00:33:11,570 --> 00:33:15,450
And those values are
exactly 1/n, 2/n, 3/n, 4/n.

676
00:33:15,450 --> 00:33:17,160
It's all the way to n/n, right?

677
00:33:17,160 --> 00:33:19,920
That's exactly where
the flat parts are.

678
00:33:19,920 --> 00:33:23,310
We know we jump
from 1/n every time.

679
00:33:23,310 --> 00:33:25,250
And so that's
exactly the recipe.

680
00:33:25,250 --> 00:33:29,960
It says look at those
values, 1/n, 2/n, 3/n

681
00:33:29,960 --> 00:33:32,500
until, say, n minus 1 over n.

682
00:33:32,500 --> 00:33:35,450
And for those values,
compute the inverse

683
00:33:35,450 --> 00:33:40,130
of both the empiricial
CDF and the true CDF.

684
00:33:40,130 --> 00:33:43,025
Now, for the empirical
CDF, it's actually easy.

685
00:33:43,025 --> 00:33:45,590
I just told you this is
basically where the points--

686
00:33:45,590 --> 00:33:47,540
where the jumps occur.

687
00:33:47,540 --> 00:33:49,010
And the jumps occur where?

688
00:33:49,010 --> 00:33:53,010
Well, exactly at
my observations.

689
00:33:53,010 --> 00:33:56,420
Now, remember I need to sort
those observations to talk

690
00:33:56,420 --> 00:33:57,230
about them.

691
00:33:57,230 --> 00:34:00,200
So the one that occurs
for the i-th jump

692
00:34:00,200 --> 00:34:07,030
is the i-th largest observation,
which we denoted by X sub (i).

693
00:34:07,030 --> 00:34:07,530
Remember?

694
00:34:07,530 --> 00:34:11,785
We had this formula that we
said, well, we have x1, xn.

695
00:34:11,785 --> 00:34:13,081
These are my data.

696
00:34:13,081 --> 00:34:14,580
And what I'm going
to sort them into

697
00:34:14,580 --> 00:34:18,824
is x sub (1), which is
less than or equal to x

698
00:34:18,824 --> 00:34:23,801
sub (2), which is
less than x sub (n).

699
00:34:23,801 --> 00:34:24,300
OK?

700
00:34:24,300 --> 00:34:26,560
So we just ordered them
from smallest to largest.

701
00:34:26,560 --> 00:34:28,290
And then now we've
done that, we just

702
00:34:28,290 --> 00:34:30,010
put this parenthesis notation.

703
00:34:30,010 --> 00:34:34,850
So in particular,
Fn inverse of i/n

704
00:34:34,850 --> 00:34:38,010
is the location where
the i-th jumps occur,

705
00:34:38,010 --> 00:34:40,600
which is the i-th
largest observation.

706
00:34:40,600 --> 00:34:42,380
OK?

707
00:34:42,380 --> 00:34:47,270
So for this guy, these
values, the y-axes

708
00:34:47,270 --> 00:34:49,460
are actually fairly easy.

709
00:34:49,460 --> 00:34:53,389
I know it's basically
my ordered observations.

710
00:34:53,389 --> 00:34:58,100
The x-values are-- well,
that depends on the function

711
00:34:58,100 --> 00:34:59,090
F I'm trying to test.

712
00:34:59,090 --> 00:35:01,490
If it's the Gaussian,
it's just the quantile

713
00:35:01,490 --> 00:35:05,400
of order 1 minus 1/n, right?

714
00:35:05,400 --> 00:35:08,570
It's this Q1 minus 1/n here
that I need to compute.

715
00:35:08,570 --> 00:35:11,180
It's the inverse of the
cumulative distribution

716
00:35:11,180 --> 00:35:13,950
function, which, given
the formula for F,

717
00:35:13,950 --> 00:35:16,632
you can actually compute or
maybe estimate fairly well.

718
00:35:16,632 --> 00:35:18,590
But it's something that
you can find in tables.

719
00:35:18,590 --> 00:35:20,770
Those are basically quantiles.

720
00:35:20,770 --> 00:35:23,940
Inverse of CDFs are
quantiles, right?

721
00:35:23,940 --> 00:35:28,510
And so that's basically the
things we're interested in.

722
00:35:28,510 --> 00:35:30,420
That's why it's called
quantile-quantile.

723
00:35:30,420 --> 00:35:34,560
Those are sometimes referred
to as theoretical quantiles,

724
00:35:34,560 --> 00:35:37,200
the one we're trying to test,
and empirical quantiles,

725
00:35:37,200 --> 00:35:39,870
the one that corresponds
to the empirical CDF.

726
00:35:39,870 --> 00:35:44,190
And so I'm plotting a plot
where the x-axis is quantile.

727
00:35:44,190 --> 00:35:45,930
The y-axis is quantile.

728
00:35:45,930 --> 00:35:49,010
And so I call this plot a
quantile-quantile plot, or QQ

729
00:35:49,010 --> 00:35:54,330
plot, because, well, just say
10 times quantile-quantile,

730
00:35:54,330 --> 00:35:55,584
and then you'll see why.

731
00:35:55,584 --> 00:35:56,572
Yeah?

732
00:35:56,572 --> 00:35:59,977
AUDIENCE: [INAUDIBLE] have
to have the [INAUDIBLE]??

733
00:35:59,977 --> 00:36:01,560
PHILIPPE RIGOLLET:
Well, that's just--

734
00:36:01,560 --> 00:36:03,270
we're back to the--

735
00:36:03,270 --> 00:36:06,030
we're back to the
goodness-of-fit test, right?

736
00:36:06,030 --> 00:36:08,060
So if you look--

737
00:36:08,060 --> 00:36:10,060
so you don't do it yourself.

738
00:36:10,060 --> 00:36:11,150
That's the simple answer.

739
00:36:11,150 --> 00:36:14,310
You don't-- I'm just telling you
how those plots are going to be

740
00:36:14,310 --> 00:36:17,760
seen spit out from a software
are going to look like.

741
00:36:17,760 --> 00:36:19,530
Now, depending on
the software, there's

742
00:36:19,530 --> 00:36:21,240
a different thing
that's happening.

743
00:36:21,240 --> 00:36:25,050
Some softwares are actually
plotting F with the right--

744
00:36:25,050 --> 00:36:27,480
let's say you want to
do normal, as you asked.

745
00:36:27,480 --> 00:36:30,420
So some software are
just going to use F

746
00:36:30,420 --> 00:36:33,690
to be with mu hat and
sigma hat, and that's fine.

747
00:36:33,690 --> 00:36:36,150
Some software are actually
not going to do this.

748
00:36:36,150 --> 00:36:39,270
They're just going
to use a Gaussian.

749
00:36:39,270 --> 00:36:41,700
But then they're
going to actually have

750
00:36:41,700 --> 00:36:43,710
a different reference point.

751
00:36:43,710 --> 00:36:45,960
So what do we want to see here?

752
00:36:45,960 --> 00:36:48,960
What should happen
if all these points--

753
00:36:48,960 --> 00:36:51,510
if all my points
actually come from F,

754
00:36:51,510 --> 00:36:53,700
from a distribution
that has CDF F?

755
00:36:53,700 --> 00:36:54,560
What should happen?

756
00:36:54,560 --> 00:36:55,310
What should I see?

757
00:36:58,650 --> 00:37:01,510
Well, since Fn
should be close to F,

758
00:37:01,510 --> 00:37:04,710
Fn inverse should be
close to F inverse, which

759
00:37:04,710 --> 00:37:07,202
means that this point should
be close to that point.

760
00:37:07,202 --> 00:37:08,910
This point should be
close to that point.

761
00:37:08,910 --> 00:37:13,080
So ideally, if I actually
pick the right F,

762
00:37:13,080 --> 00:37:19,000
I should see a plot that looks
like this, something where

763
00:37:19,000 --> 00:37:24,520
all my points are very
close to the line y

764
00:37:24,520 --> 00:37:26,904
is equal to x, right?

765
00:37:26,904 --> 00:37:28,570
And I'm going to have
some fluctuations,

766
00:37:28,570 --> 00:37:31,710
but something very
close to this.

767
00:37:31,710 --> 00:37:34,260
Now, that's if F is
exactly the right one.

768
00:37:34,260 --> 00:37:36,880
If F is not exactly the
right one, in particular,

769
00:37:36,880 --> 00:37:40,920
in the case of a Gaussian
one, if I actually

770
00:37:40,920 --> 00:37:43,410
plotted here the quantiles--

771
00:37:43,410 --> 00:37:52,550
so if I plotted F
0, 1 of t, right?

772
00:37:52,550 --> 00:37:54,680
So let's say those are
the ones I actually plot,

773
00:37:54,680 --> 00:37:57,350
but I really don't know
what-- mu hat is not 0

774
00:37:57,350 --> 00:37:59,000
and sigma hat is not 0.

775
00:37:59,000 --> 00:38:01,310
And so this is not the
one I should be getting.

776
00:38:01,310 --> 00:38:06,410
Since we actually know that
phi of mu hat sigma hat

777
00:38:06,410 --> 00:38:12,890
squared t is equal to phi 0,
1 of t minus mu hat divided

778
00:38:12,890 --> 00:38:16,610
by sigma hat, there's
just this change

779
00:38:16,610 --> 00:38:19,460
of axis, which is
actually very simple.

780
00:38:19,460 --> 00:38:22,370
This change of axis is just
a simple translation scaling,

781
00:38:22,370 --> 00:38:26,602
which means that
this line here is

782
00:38:26,602 --> 00:38:28,310
going to be transformed
into another line

783
00:38:28,310 --> 00:38:31,530
with a different slope
and a different intercept.

784
00:38:31,530 --> 00:38:34,400
And so some software
will actually decide

785
00:38:34,400 --> 00:38:37,900
to go with this curve
and just show you

786
00:38:37,900 --> 00:38:39,370
what the reference
curve should be,

787
00:38:39,370 --> 00:38:41,203
rather than actually
putting everything back

788
00:38:41,203 --> 00:38:43,370
onto the 45-degree curve.

789
00:38:43,370 --> 00:38:45,120
AUDIENCE: So if you
get any straight line?

790
00:38:45,120 --> 00:38:47,290
PHILIPPE RIGOLLET: Any
straight line, you're happy.

791
00:38:47,290 --> 00:38:49,600
I mean, depending
on the software.

792
00:38:49,600 --> 00:38:53,180
Because if the software actually
really rescaled this thing

793
00:38:53,180 --> 00:38:56,880
to have mu hat and sigma square
and you find a different line,

794
00:38:56,880 --> 00:38:58,510
a different straight
line, this is

795
00:38:58,510 --> 00:39:01,720
bad news, which is not
going to happen actually.

796
00:39:01,720 --> 00:39:05,040
It's impossible that happens,
because you actually-- well,

797
00:39:05,040 --> 00:39:06,170
it could.

798
00:39:06,170 --> 00:39:07,895
If it's crazy, it could.

799
00:39:07,895 --> 00:39:09,510
It shouldn't be very crazy.

800
00:39:09,510 --> 00:39:10,010
OK.

801
00:39:10,010 --> 00:39:14,600
So let's see what R does
for us, for example.

802
00:39:14,600 --> 00:39:20,380
So here in R, R actually
does this funny trick where--

803
00:39:20,380 --> 00:39:22,240
so here I did not
actually plot the lines.

804
00:39:22,240 --> 00:39:23,573
I should actually add the lines.

805
00:39:23,573 --> 00:39:27,839
So the command is like
qqnorm of my sample, right?

806
00:39:27,839 --> 00:39:28,880
And that's really simple.

807
00:39:28,880 --> 00:39:33,580
I just stack all my data
into some vector, say, x.

808
00:39:33,580 --> 00:39:40,150
And I say qqnorm of x, and
it just spits this thing out.

809
00:39:40,150 --> 00:39:40,720
OK?

810
00:39:40,720 --> 00:39:42,000
Very simple.

811
00:39:42,000 --> 00:39:44,262
But I could actually
add another command,

812
00:39:44,262 --> 00:39:45,220
which I can't remember.

813
00:39:45,220 --> 00:39:50,670
I think it's like qqline,
and it's just going

814
00:39:50,670 --> 00:39:52,980
to add the line on top of it.

815
00:39:52,980 --> 00:39:55,210
But if you see, actually
what R does for us,

816
00:39:55,210 --> 00:39:58,830
it's actually doing the
translation and scaling

817
00:39:58,830 --> 00:40:01,710
on the axes themselves.

818
00:40:01,710 --> 00:40:05,587
So it actually changes
the x and y-axis in such a

819
00:40:05,587 --> 00:40:07,170
way that when you
look at your picture

820
00:40:07,170 --> 00:40:09,570
and you forget about what
the meaning of the axes are,

821
00:40:09,570 --> 00:40:11,520
the relevant straight
line is actually

822
00:40:11,520 --> 00:40:13,185
still the 45-degree line.

823
00:40:13,185 --> 00:40:17,605
It's Because it's actually done
the change of units for you.

824
00:40:17,605 --> 00:40:19,230
So you don't have to
even see the line.

825
00:40:19,230 --> 00:40:21,630
You know that, in your mind,
that this is basically--

826
00:40:21,630 --> 00:40:25,520
the reference line is still
45 degree because that's

827
00:40:25,520 --> 00:40:27,050
the way the axes are made.

828
00:40:27,050 --> 00:40:29,940
But if I actually put my axes,
right-- so here, for example,

829
00:40:29,940 --> 00:40:31,490
it goes from--

830
00:40:31,490 --> 00:40:32,820
let's look at some--

831
00:40:32,820 --> 00:40:36,310
well, OK, those are all square.

832
00:40:36,310 --> 00:40:38,810
Yeah, and that's probably
because they actually have--

833
00:40:38,810 --> 00:40:41,380
the samples are actually
from a standard normal.

834
00:40:41,380 --> 00:40:43,245
So I did not make
my life very easy

835
00:40:43,245 --> 00:40:45,120
to illustrate your
question, but of course, I

836
00:40:45,120 --> 00:40:46,661
didn't know you were
going to ask it.

837
00:40:46,661 --> 00:40:49,280
Next time, let's just prepare.

838
00:40:49,280 --> 00:40:50,760
Let's script more.

839
00:40:50,760 --> 00:40:52,620
We'll see another
one in the next plot.

840
00:40:52,620 --> 00:40:54,540
But so here what
you expect to see

841
00:40:54,540 --> 00:40:58,410
is that all the plots should be
on the 45-degree line, right?

842
00:40:58,410 --> 00:40:59,830
This should be the right one.

843
00:40:59,830 --> 00:41:02,850
And if you see, when I
start having 10,000 samples,

844
00:41:02,850 --> 00:41:04,480
this is exactly
what's happening.

845
00:41:04,480 --> 00:41:05,930
So this is as good as it gets.

846
00:41:05,930 --> 00:41:08,240
This is an N(0, 1) plotted
against the theoretical

847
00:41:08,240 --> 00:41:10,300
quantile of an N(0, 1).

848
00:41:10,300 --> 00:41:12,090
As good as it gets.

849
00:41:12,090 --> 00:41:15,110
And if you see, for the
second one, which is 50,

850
00:41:15,110 --> 00:41:16,610
sample size of size--

851
00:41:16,610 --> 00:41:19,730
sample of size 50, there is
some fudge factor, right?

852
00:41:19,730 --> 00:41:20,690
I mean, those things--

853
00:41:20,690 --> 00:41:22,310
doesn't look like there's
a straight line, right?

854
00:41:22,310 --> 00:41:24,851
It sort of appears that there
are some weird things happening

855
00:41:24,851 --> 00:41:27,810
here at the lower tail.

856
00:41:27,810 --> 00:41:29,310
And the reason why
this is happening

857
00:41:29,310 --> 00:41:32,400
is because we're trying to
compare the tails, right?

858
00:41:32,400 --> 00:41:34,980
When I look at this picture,
the only thing that goes wrong

859
00:41:34,980 --> 00:41:37,050
somehow is always at
the tip, because those

860
00:41:37,050 --> 00:41:39,090
are sort of rare
and extreme values,

861
00:41:39,090 --> 00:41:41,100
and they're sort of
all over the place.

862
00:41:41,100 --> 00:41:44,610
And so things are never really
super smooth and super clean.

863
00:41:44,610 --> 00:41:46,920
So this is what
your best shot is.

864
00:41:46,920 --> 00:41:49,140
This is what you will
ever hope to get.

865
00:41:49,140 --> 00:41:52,486
So size 10, right, so
you have 10 points.

866
00:41:52,486 --> 00:41:54,360
Remember, we actually--
well, I didn't really

867
00:41:54,360 --> 00:41:56,220
tell you how to deal
with the extreme cases.

868
00:41:56,220 --> 00:41:59,720
Because the problem is that
F inverse of 1 for the true F

869
00:41:59,720 --> 00:42:01,050
is plus infinity.

870
00:42:01,050 --> 00:42:04,350
So you have to make some sort
of weird boundary choices

871
00:42:04,350 --> 00:42:07,830
to decide what F inverse
of 1 is, and it's something

872
00:42:07,830 --> 00:42:09,694
that's like somewhere.

873
00:42:09,694 --> 00:42:11,610
But you still want to
put like 10 dots, right?

874
00:42:11,610 --> 00:42:15,450
1, 2, 3, 4, 5, 6,
7, 8, 9, 10 dots.

875
00:42:15,450 --> 00:42:17,610
So I have 10 observations,
you will see 10 dots.

876
00:42:17,610 --> 00:42:21,230
I have 50 observations, you
will see 50 dots, right,

877
00:42:21,230 --> 00:42:22,230
because I have--

878
00:42:22,230 --> 00:42:26,720
there are 1/n, 2/n,
3/n all the way to n/n.

879
00:42:26,720 --> 00:42:29,010
I didn't tell you the last one.

880
00:42:29,010 --> 00:42:29,510
OK.

881
00:42:29,510 --> 00:42:31,490
So this is when things
go well, and this is

882
00:42:31,490 --> 00:42:32,881
when things should not go well.

883
00:42:32,881 --> 00:42:33,380
OK?

884
00:42:33,380 --> 00:42:35,030
So here, actually,
the distribution

885
00:42:35,030 --> 00:42:37,670
is a Student's t with
15 degrees of freedom,

886
00:42:37,670 --> 00:42:41,180
which should depart somewhat
from a Gaussian distribution.

887
00:42:41,180 --> 00:42:44,070
The tails should be heavier.

888
00:42:44,070 --> 00:42:47,700
And what you can see is
basically the following,

889
00:42:47,700 --> 00:42:51,240
is that for 10 you actually see
something that's crazy, right,

890
00:42:51,240 --> 00:42:52,980
if I do 10 observations.

891
00:42:52,980 --> 00:42:55,075
But if I do 50
observations, honestly, it's

892
00:42:55,075 --> 00:42:56,700
kind of hard to say
that it's different

893
00:42:56,700 --> 00:42:58,560
from the standard normal.

894
00:42:58,560 --> 00:43:01,410
So you could still be
happy with this for 100.

895
00:43:01,410 --> 00:43:03,840
And then this is what's
happening for 10,000.

896
00:43:03,840 --> 00:43:06,957
And even here it's not the
beautiful straight line,

897
00:43:06,957 --> 00:43:08,790
but it feels like you
would be still tempted

898
00:43:08,790 --> 00:43:11,580
to conclude that it's a
beautiful straight line.

899
00:43:11,580 --> 00:43:13,800
So let's try to guess.

900
00:43:13,800 --> 00:43:18,420
So basically, there's-- for
each of those sides there's two

901
00:43:18,420 --> 00:43:18,960
phenomena.

902
00:43:18,960 --> 00:43:22,080
Either it goes like this
or it goes like this,

903
00:43:22,080 --> 00:43:24,960
and then it goes like
this or it goes like this.

904
00:43:24,960 --> 00:43:28,200
Each side corresponds to the
left tail, all the smallest

905
00:43:28,200 --> 00:43:29,360
values.

906
00:43:29,360 --> 00:43:30,360
So that's the left side.

907
00:43:30,360 --> 00:43:31,984
And that's the right
side-- corresponds

908
00:43:31,984 --> 00:43:33,040
to the large values.

909
00:43:33,040 --> 00:43:33,540
OK?

910
00:43:33,540 --> 00:43:35,460
And so basically
you can actually

911
00:43:35,460 --> 00:43:40,050
think of some sort of
a table that tells you

912
00:43:40,050 --> 00:43:41,310
what your QQ plot looks like.

913
00:43:47,220 --> 00:43:48,650
And so let's say it looks--

914
00:43:48,650 --> 00:43:50,840
so we have our reference
45-degree line.

915
00:43:50,840 --> 00:43:52,960
So let's say this
is the QQ plot.

916
00:43:52,960 --> 00:43:54,820
That could be one thing.

917
00:43:54,820 --> 00:43:59,380
This could be the QQ plot
where I have another thing.

918
00:43:59,380 --> 00:44:08,890
Then I can do this guy,
and then I do this guy.

919
00:44:08,890 --> 00:44:10,690
So this is like this.

920
00:44:10,690 --> 00:44:11,642
OK?

921
00:44:11,642 --> 00:44:13,665
So those are the four cases.

922
00:44:13,665 --> 00:44:14,950
OK?

923
00:44:14,950 --> 00:44:19,000
And here what's changing
is the right tail,

924
00:44:19,000 --> 00:44:20,970
and here what's
changing is the--

925
00:44:20,970 --> 00:44:24,820
and when I go from here to here,
what changes is the left tail.

926
00:44:24,820 --> 00:44:26,851
Is that true?

927
00:44:26,851 --> 00:44:27,350
No, sorry.

928
00:44:27,350 --> 00:44:29,290
What changes here is
the right tail, right?

929
00:44:29,290 --> 00:44:34,110
It's this part that
changes from top to bottom.

930
00:44:34,110 --> 00:44:38,542
So here it's something
about right tail,

931
00:44:38,542 --> 00:44:40,410
and here that's something
about left tail.

932
00:44:44,060 --> 00:44:46,805
Everybody understands what I
mean when I talk about tails?

933
00:44:46,805 --> 00:44:48,200
OK.

934
00:44:48,200 --> 00:44:50,120
And so here it's
just going to be

935
00:44:50,120 --> 00:44:52,670
a question of whether
the tails are heavier

936
00:44:52,670 --> 00:44:54,689
or lighter than the Gaussian.

937
00:44:54,689 --> 00:44:56,480
Everybody understand
what I mean when I say

938
00:44:56,480 --> 00:44:58,640
heavy tails and light tails?

939
00:44:58,640 --> 00:44:59,600
OK.

940
00:44:59,600 --> 00:45:01,670
So right, so heavy
tails just means

941
00:45:01,670 --> 00:45:04,880
that basically here
the tails of this guy

942
00:45:04,880 --> 00:45:06,540
are heavier than the
tails of this guy.

943
00:45:06,540 --> 00:45:08,785
So it means that if I draw
them, they're going to be above.

944
00:45:08,785 --> 00:45:10,520
Actually, I'm going to keep
this picture because it's

945
00:45:10,520 --> 00:45:11,811
going to be very useful for me.

946
00:45:16,170 --> 00:45:19,650
When I plug the quantiles
at the same-- so let's

947
00:45:19,650 --> 00:45:21,180
look at the right
tail, for example.

948
00:45:21,180 --> 00:45:23,610
Right here my picture
is for right tails.

949
00:45:23,610 --> 00:45:26,350
When I look at the quantiles of
my theoretical distribution--

950
00:45:26,350 --> 00:45:28,440
so here you can see
the bottom curve

951
00:45:28,440 --> 00:45:31,420
we have the
theoretical quantiles,

952
00:45:31,420 --> 00:45:34,800
and those are the
empirical quantiles.

953
00:45:34,800 --> 00:45:39,090
If I look to the right here,
are the theoretical quantiles

954
00:45:39,090 --> 00:45:41,770
larger or smaller than
the empirical quantiles?

955
00:45:47,124 --> 00:45:48,290
Let me phrase it the other--

956
00:45:48,290 --> 00:45:50,460
are the empirical
quantiles larger or smaller

957
00:45:50,460 --> 00:45:53,250
than the theoretical quantiles?

958
00:45:53,250 --> 00:45:56,610
AUDIENCE: This is a graph
of quantiles, right?

959
00:45:56,610 --> 00:45:59,072
So if it's [INAUDIBLE]
it should be smaller.

960
00:45:59,072 --> 00:46:01,030
PHILIPPE RIGOLLET: It
should be smaller, right?

961
00:46:01,030 --> 00:46:04,190
On this line, they are equal.

962
00:46:04,190 --> 00:46:07,180
So if I see the empirical
quantile showing up here,

963
00:46:07,180 --> 00:46:10,510
it means that here the
empirical quantile is less

964
00:46:10,510 --> 00:46:12,550
than the theoretical quantile.

965
00:46:12,550 --> 00:46:13,890
Agree?

966
00:46:13,890 --> 00:46:16,410
So that means that if
I look at this thing--

967
00:46:16,410 --> 00:46:18,540
and that's for the
same values, right?

968
00:46:18,540 --> 00:46:22,440
So the quantiles are computed
for the same values i/n.

969
00:46:22,440 --> 00:46:25,890
So it means that the empirical
quantiles should be looking--

970
00:46:25,890 --> 00:46:29,840
so that should be the
empirical quantile,

971
00:46:29,840 --> 00:46:32,470
and that should be the
theoretical quantile.

972
00:46:32,470 --> 00:46:34,390
Agreed?

973
00:46:34,390 --> 00:46:37,730
Those are the smaller
values for the same alpha.

974
00:46:37,730 --> 00:46:41,300
So that implies that the tails--

975
00:46:41,300 --> 00:46:43,880
the right tail, is
it heavy or lighter--

976
00:46:43,880 --> 00:46:45,530
heavier or lighter
than the Gaussian?

977
00:46:50,390 --> 00:46:51,140
AUDIENCE: Lighter.

978
00:46:51,140 --> 00:46:52,200
PHILIPPE RIGOLLET:
Lighter, right?

979
00:46:52,200 --> 00:46:54,033
Because those are the
tails of the Gaussian.

980
00:46:54,033 --> 00:46:55,650
Those are my
theoretical quantiles.

981
00:46:55,650 --> 00:46:59,580
That means that this is the tail
of my empirical distribution.

982
00:46:59,580 --> 00:47:00,870
So they are actually lighter.

983
00:47:08,090 --> 00:47:09,250
OK?

984
00:47:09,250 --> 00:47:11,500
So here, if I look
at this thing,

985
00:47:11,500 --> 00:47:18,240
this means that the right
tail is actually light.

986
00:47:18,240 --> 00:47:20,800
And by light, I mean
lighter than Gaussian.

987
00:47:20,800 --> 00:47:22,650
Heavy, I mean heavier
than Gaussian.

988
00:47:22,650 --> 00:47:23,730
OK?

989
00:47:23,730 --> 00:47:27,150
OK, now we can probably
do the entire thing.

990
00:47:27,150 --> 00:47:31,980
Well, if this is light, this
is going to be heavy, right?

991
00:47:31,980 --> 00:47:33,520
That's when I'm above the curve.

992
00:47:36,820 --> 00:47:40,390
Exercise-- is this light or is
this heavy, the first column?

993
00:47:46,970 --> 00:47:47,900
And it's OK.

994
00:47:47,900 --> 00:47:51,734
It should take you
at least 30 seconds.

995
00:47:51,734 --> 00:47:53,570
AUDIENCE: [INAUDIBLE]
different column?

996
00:47:53,570 --> 00:47:54,740
PHILIPPE RIGOLLET: Yeah,
this column, right?

997
00:47:54,740 --> 00:47:56,240
So this is something
that pertains--

998
00:47:56,240 --> 00:47:59,080
this entire column is going
to tell me whether the fact

999
00:47:59,080 --> 00:48:01,620
that this guy is
above, does this

1000
00:48:01,620 --> 00:48:06,570
mean that I have lighter
or heavier left tails?

1001
00:48:06,570 --> 00:48:09,050
AUDIENCE: Well, on the
left, it's heavier.

1002
00:48:09,050 --> 00:48:11,150
PHILIPPE RIGOLLET: On
the left, it's heavier.

1003
00:48:11,150 --> 00:48:12,090
OK.

1004
00:48:12,090 --> 00:48:12,672
I don't know.

1005
00:48:12,672 --> 00:48:14,130
Actually, I need
to draw a picture.

1006
00:48:14,130 --> 00:48:17,348
You guys are probably
faster than I am.

1007
00:48:17,348 --> 00:48:19,872
AUDIENCE: [INTERPOSING VOICES].

1008
00:48:19,872 --> 00:48:21,330
PHILIPPE RIGOLLET:
Actually, let me

1009
00:48:21,330 --> 00:48:23,400
check how much randomness is--

1010
00:48:23,400 --> 00:48:26,430
who says it's lighter?

1011
00:48:26,430 --> 00:48:27,450
Who says it's heavier?

1012
00:48:27,450 --> 00:48:29,880
AUDIENCE: Yeah,
but we're biased.

1013
00:48:29,880 --> 00:48:30,852
AUDIENCE: [INAUDIBLE]

1014
00:48:30,852 --> 00:48:32,018
PHILIPPE RIGOLLET: Yeah, OK.

1015
00:48:32,018 --> 00:48:33,610
AUDIENCE: [INAUDIBLE]

1016
00:48:33,610 --> 00:48:34,818
PHILIPPE RIGOLLET: All right.

1017
00:48:34,818 --> 00:48:36,760
So let's see if it's heavier.

1018
00:48:36,760 --> 00:48:40,786
So we're on the left tail, and
so we have one looks like this,

1019
00:48:40,786 --> 00:48:41,910
one looks like that, right?

1020
00:48:45,410 --> 00:48:49,100
So we know here that I'm
looking at this part here.

1021
00:48:49,100 --> 00:48:52,070
So it means that here my
empirical quantile is larger

1022
00:48:52,070 --> 00:48:53,320
than the theoretical quantile.

1023
00:48:58,480 --> 00:49:00,350
OK?

1024
00:49:00,350 --> 00:49:02,030
So are my tails
heavier or lighter?

1025
00:49:06,125 --> 00:49:07,280
They're lighter.

1026
00:49:07,280 --> 00:49:08,180
That was a bad bias.

1027
00:49:08,180 --> 00:49:10,299
AUDIENCE: [INAUDIBLE]

1028
00:49:10,299 --> 00:49:11,340
PHILIPPE RIGOLLET: Right?

1029
00:49:11,340 --> 00:49:14,660
It's below, so it's lighter.

1030
00:49:14,660 --> 00:49:19,100
Because the problem is that
larger for the negative ones

1031
00:49:19,100 --> 00:49:22,068
means that it's smaller
[INAUDIBLE],, right?

1032
00:49:22,068 --> 00:49:23,550
Yeah?

1033
00:49:23,550 --> 00:49:26,514
AUDIENCE: Sorry but, what
exactly are these [INAUDIBLE]??

1034
00:49:26,514 --> 00:49:28,984
If this is the inverse--

1035
00:49:28,984 --> 00:49:32,936
if this is the inverse
CDF, shouldn't everything--

1036
00:49:32,936 --> 00:49:34,912
well, if this is
the inverse CDF,

1037
00:49:34,912 --> 00:49:36,394
then you should
only be inputting

1038
00:49:36,394 --> 00:49:38,864
values between 0 and 1 in it.

1039
00:49:38,864 --> 00:49:40,840
And--

1040
00:49:40,840 --> 00:49:42,900
PHILIPPE RIGOLLET: Oh,
did I put the inverse CDF?

1041
00:49:42,900 --> 00:49:46,814
AUDIENCE: Like on the
previous slide, I think.

1042
00:49:46,814 --> 00:49:48,230
PHILIPPE RIGOLLET:
No, the inverse

1043
00:49:48,230 --> 00:49:49,910
CDF, yeah, so I'm inputting--

1044
00:49:49,910 --> 00:49:51,339
AUDIENCE: Oh,
you're [INAUDIBLE]..

1045
00:49:51,339 --> 00:49:53,630
PHILIPPE RIGOLLET: Yeah, so
it's a scatter plot, right?

1046
00:49:53,630 --> 00:49:56,780
So each point is
attached-- each point

1047
00:49:56,780 --> 00:49:59,990
is attached 1/n, 2/n, 3/n.

1048
00:49:59,990 --> 00:50:01,600
Now, for each
point I'm plotting,

1049
00:50:01,600 --> 00:50:05,060
that's my x-value, which
maps a number between 0 and 1

1050
00:50:05,060 --> 00:50:09,690
back onto the entire real line,
and my y-value is the same.

1051
00:50:09,690 --> 00:50:10,190
OK?

1052
00:50:10,190 --> 00:50:14,370
So what it means is that those
two numbers, this is in the--

1053
00:50:14,370 --> 00:50:17,330
this lives on the entire real
line, not on the interval.

1054
00:50:17,330 --> 00:50:20,540
This lives on the entire real
line, not in the interval.

1055
00:50:20,540 --> 00:50:26,630
And so my QQ plots take values
on the entire real line,

1056
00:50:26,630 --> 00:50:28,660
entire real line, right?

1057
00:50:28,660 --> 00:50:31,915
So you think of it as a
parameterized curve, where

1058
00:50:31,915 --> 00:50:34,610
the time steps
are 1/n, 2/n, 3/n,

1059
00:50:34,610 --> 00:50:38,740
and I'm just like putting a dot
every time I'm making one step.

1060
00:50:38,740 --> 00:50:41,470
OK?

1061
00:50:41,470 --> 00:50:43,540
OK, so what did we say?

1062
00:50:43,540 --> 00:50:46,356
That was lighter, right?

1063
00:50:46,356 --> 00:50:51,196
AUDIENCE: [INAUDIBLE]

1064
00:50:51,196 --> 00:50:54,110
PHILIPPE RIGOLLET: OK?

1065
00:50:54,110 --> 00:50:58,380
One of my favorite exercises
is, here's a bunch of densities.

1066
00:50:58,380 --> 00:51:00,140
Here's a bunch of QQ plots.

1067
00:51:00,140 --> 00:51:04,490
Map the correct QQ plot
to its own density.

1068
00:51:04,490 --> 00:51:05,980
All right?

1069
00:51:05,980 --> 00:51:09,220
And there won't be mingled
lines that allow you to do that,

1070
00:51:09,220 --> 00:51:11,720
then you just have to follow,
like at the back of cereal

1071
00:51:11,720 --> 00:51:13,070
boxes.

1072
00:51:13,070 --> 00:51:15,530
All right.

1073
00:51:15,530 --> 00:51:17,165
Are there any questions?

1074
00:51:17,165 --> 00:51:18,540
So one thing--
there's two things

1075
00:51:18,540 --> 00:51:19,914
I'm trying to
communicate here is

1076
00:51:19,914 --> 00:51:22,460
if you see a QQ plot, now
you should understand,

1077
00:51:22,460 --> 00:51:28,350
one, how it was built, and two,
whether it means that you have

1078
00:51:28,350 --> 00:51:30,520
heavier tails or lighter tails.

1079
00:51:30,520 --> 00:51:32,760
Now, let's look at this guy.

1080
00:51:32,760 --> 00:51:34,800
What should we see?

1081
00:51:34,800 --> 00:51:37,480
We should see heavy on the left
and heavy on the right, right?

1082
00:51:37,480 --> 00:51:39,360
We know that this
should be the case.

1083
00:51:39,360 --> 00:51:45,130
So this thing actually looks
like this, and it sort of does,

1084
00:51:45,130 --> 00:51:46,250
right?

1085
00:51:46,250 --> 00:51:48,860
If I take this line
going through here,

1086
00:51:48,860 --> 00:51:50,620
I can see that this
guy's tipping here,

1087
00:51:50,620 --> 00:51:52,360
and this guy's dipping here.

1088
00:51:52,360 --> 00:51:57,670
But honestly-- actually, I can't
remember exactly, but t 15,

1089
00:51:57,670 --> 00:52:01,570
if I plotted the density
on top of the Gaussian,

1090
00:52:01,570 --> 00:52:02,776
you can see a difference.

1091
00:52:02,776 --> 00:52:04,900
But if I just gave it to
you, it would be very hard

1092
00:52:04,900 --> 00:52:07,399
for you to tell me if there's
an actual difference between t

1093
00:52:07,399 --> 00:52:08,950
15 and Gaussian, right?

1094
00:52:08,950 --> 00:52:11,076
Those things are
actually very close.

1095
00:52:11,076 --> 00:52:12,700
And so in particular,
here we're really

1096
00:52:12,700 --> 00:52:15,640
trying to recognize what
the shape is the fact--

1097
00:52:15,640 --> 00:52:16,140
right?

1098
00:52:16,140 --> 00:52:20,980
So t 15 compared to a standard
Gaussian was different,

1099
00:52:20,980 --> 00:52:26,119
but t 15 compared to a Gaussian
with a slightly larger variance

1100
00:52:26,119 --> 00:52:27,910
is not going to actually--
you're not going

1101
00:52:27,910 --> 00:52:29,090
to see much of a difference.

1102
00:52:29,090 --> 00:52:33,610
So in a way, such
distributions are actually not

1103
00:52:33,610 --> 00:52:35,890
too far from the Gaussian,
and it's not too--

1104
00:52:35,890 --> 00:52:38,950
it's still pretty benign to
conclude that this was actually

1105
00:52:38,950 --> 00:52:42,283
a Gaussian distribution because
you can just use the variance

1106
00:52:42,283 --> 00:52:43,750
as a little bit of a buffer.

1107
00:52:43,750 --> 00:52:45,250
I'm not going to
get really into how

1108
00:52:45,250 --> 00:52:50,500
you would use a
t-distribution into a t-test,

1109
00:52:50,500 --> 00:52:54,420
because it's kind of
like Inception, right?

1110
00:52:54,420 --> 00:52:58,150
So but you could pretend
that your data actually

1111
00:52:58,150 --> 00:53:02,010
is t-distributed and then
build a t-distribution from it,

1112
00:53:02,010 --> 00:53:03,570
but let's not say that.

1113
00:53:03,570 --> 00:53:05,490
Maybe that was a bad example.

1114
00:53:05,490 --> 00:53:08,280
But there's like other
heavy-tailed distributions like

1115
00:53:08,280 --> 00:53:10,825
Cauchy distribution, which
doesn't even have a--

1116
00:53:10,825 --> 00:53:12,450
it's not even integrable
because that's

1117
00:53:12,450 --> 00:53:14,490
as heavy as the tails get.

1118
00:53:14,490 --> 00:53:18,760
And this you can really tell
it's going to look like this.

1119
00:53:18,760 --> 00:53:22,010
It's going to be like pfft.

1120
00:53:22,010 --> 00:53:24,240
What does a uniform
distribution look like?

1121
00:53:30,727 --> 00:53:32,210
Like this?

1122
00:53:32,210 --> 00:53:37,890
It's going to be-- it's going
to look like a Gaussian one,

1123
00:53:37,890 --> 00:53:38,940
right?

1124
00:53:38,940 --> 00:53:41,030
So a uniform-- so
this is my Gaussian.

1125
00:53:41,030 --> 00:53:43,130
A uniform is basically
going to look like this,

1126
00:53:43,130 --> 00:53:46,260
one side take the right mean
and the right variance, right?

1127
00:53:46,260 --> 00:53:48,480
So the tails are
definitely lighter.

1128
00:53:48,480 --> 00:53:49,640
They're 0.

1129
00:53:49,640 --> 00:53:51,570
That's as lighter as it gets.

1130
00:53:51,570 --> 00:53:55,290
So the light-light is going
to look like this S shape.

1131
00:53:55,290 --> 00:53:59,050
So an S-- light-tailed
distribution has this S shape.

1132
00:53:59,050 --> 00:53:59,820
OK?

1133
00:53:59,820 --> 00:54:02,520
What is the exponential
going to look like?

1134
00:54:06,620 --> 00:54:08,500
So the exponential is
positively supported.

1135
00:54:08,500 --> 00:54:10,430
It only has positive numbers.

1136
00:54:10,430 --> 00:54:11,750
So there's no left tail.

1137
00:54:11,750 --> 00:54:14,110
This is also as
light as it gets.

1138
00:54:14,110 --> 00:54:16,480
But the right tail, is
it heavier or lighter

1139
00:54:16,480 --> 00:54:17,230
than the Gaussian?

1140
00:54:17,230 --> 00:54:18,420
AUDIENCE: Heavier.

1141
00:54:18,420 --> 00:54:19,080
PHILIPPE RIGOLLET:
It's heavier, right?

1142
00:54:19,080 --> 00:54:21,990
It's only the case like e of the
minus x rather e to the minus

1143
00:54:21,990 --> 00:54:22,860
x squared.

1144
00:54:22,860 --> 00:54:23,760
So it's heavier.

1145
00:54:23,760 --> 00:54:27,620
So it means that on the
left it's going to be light,

1146
00:54:27,620 --> 00:54:29,430
and on the right it's
going to be heavy.

1147
00:54:29,430 --> 00:54:31,870
So it's going to be U-shaped.

1148
00:54:31,870 --> 00:54:32,370
OK?

1149
00:54:35,340 --> 00:54:37,100
That will be fine.

1150
00:54:37,100 --> 00:54:39,800
All right.

1151
00:54:39,800 --> 00:54:41,840
Any other question?

1152
00:54:41,840 --> 00:54:44,990
Again, two messages,
like, more technical,

1153
00:54:44,990 --> 00:54:47,960
and you can sort of fiddle
with it by looking at it.

1154
00:54:47,960 --> 00:54:49,670
You can definitely
conclude that this

1155
00:54:49,670 --> 00:54:53,456
is OK enough to be
Gaussian for your purposes.

1156
00:54:53,456 --> 00:54:53,956
Yeah?

1157
00:54:53,956 --> 00:54:59,591
AUDIENCE: So [INAUDIBLE]

1158
00:54:59,591 --> 00:55:01,340
PHILIPPE RIGOLLET: I
did not hear the "if"

1159
00:55:01,340 --> 00:55:02,756
at the beginning
of your sentence.

1160
00:55:06,431 --> 00:55:08,472
AUDIENCE: I would want to
be lighter tail, right,

1161
00:55:08,472 --> 00:55:10,436
because that'll be--
it's easier to reject?

1162
00:55:10,436 --> 00:55:11,909
Is that correct?

1163
00:55:16,340 --> 00:55:20,272
PHILIPPE RIGOLLET: So what
is your purpose as a--

1164
00:55:20,272 --> 00:55:21,733
AUDIENCE: I want to--

1165
00:55:21,733 --> 00:55:25,142
I have some [INAUDIBLE] right?

1166
00:55:25,142 --> 00:55:28,551
I want to be able to say
I reject H0 [INAUDIBLE]..

1167
00:55:28,551 --> 00:55:29,525
PHILIPPE RIGOLLET: Yes.

1168
00:55:29,525 --> 00:55:32,203
AUDIENCE: So if you
wanted to make it easier

1169
00:55:32,203 --> 00:55:35,002
to reject H0, then--

1170
00:55:35,002 --> 00:55:37,210
PHILIPPE RIGOLLET: Yeah, in
a way that's true, right?

1171
00:55:37,210 --> 00:55:40,440
So once you've actually factored
in the mean and the variance,

1172
00:55:40,440 --> 00:55:43,190
the only thing that actually--

1173
00:55:43,190 --> 00:55:43,690
right.

1174
00:55:43,690 --> 00:55:47,950
So if you have Gaussian tails
or lighter-- even lighter tails,

1175
00:55:47,950 --> 00:55:51,460
then it's harder for you
to explain deviations

1176
00:55:51,460 --> 00:55:52,780
from randomness only, right?

1177
00:55:52,780 --> 00:55:54,640
If you have a
uniform distribution

1178
00:55:54,640 --> 00:55:56,250
and you see something which is--

1179
00:55:56,250 --> 00:55:59,680
if you're uniform on 0, 1 plus
some number and you see 25,

1180
00:55:59,680 --> 00:56:01,960
you know this number is
not going to be 0, right?

1181
00:56:01,960 --> 00:56:04,120
So that's basically
as good as it gets.

1182
00:56:04,120 --> 00:56:06,610
And there's basically
some smooth interpolation

1183
00:56:06,610 --> 00:56:07,940
if you have lighter tails.

1184
00:56:07,940 --> 00:56:10,600
Now, if you start having
something that has heavy tails,

1185
00:56:10,600 --> 00:56:12,880
then it's more likely
that pure noise

1186
00:56:12,880 --> 00:56:15,880
will generate large observations
and therefore discovery.

1187
00:56:15,880 --> 00:56:19,160
So yes, lighter
tails is definitely

1188
00:56:19,160 --> 00:56:21,440
the better-behaved noise.

1189
00:56:21,440 --> 00:56:22,520
Let's put it this way.

1190
00:56:22,520 --> 00:56:24,740
The lighter it is, the
better behaved it is.

1191
00:56:24,740 --> 00:56:27,230
Now, this is good--

1192
00:56:27,230 --> 00:56:30,140
this is good for some purposes,
but when you want to compute

1193
00:56:30,140 --> 00:56:35,420
actual quantiles,
like exact quantiles,

1194
00:56:35,420 --> 00:56:40,070
then it is true in general that
the quantiles of lighter-tail

1195
00:56:40,070 --> 00:56:42,520
distributions are going to be
dominated by the-- are going

1196
00:56:42,520 --> 00:56:46,236
to be dominated by the--

1197
00:56:46,236 --> 00:56:47,610
let's say on the
right tails, are

1198
00:56:47,610 --> 00:56:51,410
going to be dominated by
those of a heavy distribution.

1199
00:56:51,410 --> 00:56:52,729
That is true.

1200
00:56:52,729 --> 00:56:54,020
But that's not always the case.

1201
00:56:54,020 --> 00:56:54,980
And in particular,
there's going to be

1202
00:56:54,980 --> 00:56:57,627
some like sort of weird points
where things are actually

1203
00:56:57,627 --> 00:56:59,960
changing depending on what
level you're actually looking

1204
00:56:59,960 --> 00:57:01,964
at those things,
maybe 5% or 10%,

1205
00:57:01,964 --> 00:57:04,130
in which case things might
be changing a little bit.

1206
00:57:04,130 --> 00:57:06,171
But if you started going
really towards the tail,

1207
00:57:06,171 --> 00:57:10,220
if you start looking at levels
alpha which are 1% or 0.1%,

1208
00:57:10,220 --> 00:57:13,070
it is true that it's always--

1209
00:57:13,070 --> 00:57:14,990
if you can actually--
so if you see something

1210
00:57:14,990 --> 00:57:16,790
that looks light
tail, you definitely

1211
00:57:16,790 --> 00:57:18,581
do not want to conclude
that it's Gaussian.

1212
00:57:18,581 --> 00:57:21,080
You want to actually change
your modeling so that it

1213
00:57:21,080 --> 00:57:23,240
makes your life even easier.

1214
00:57:23,240 --> 00:57:25,400
And you actually
factor in the fact

1215
00:57:25,400 --> 00:57:27,830
that you can see that the
noise is actually more benign

1216
00:57:27,830 --> 00:57:30,929
than you would like it to be.

1217
00:57:30,929 --> 00:57:31,429
OK?

1218
00:57:34,190 --> 00:57:35,440
Stretching fingers, that's it?

1219
00:57:35,440 --> 00:57:37,930
All right.

1220
00:57:37,930 --> 00:57:38,880
OK.

1221
00:57:38,880 --> 00:57:40,045
So I want to--

1222
00:57:40,045 --> 00:57:43,380
I mentioned at some point that
we had this chi-square test

1223
00:57:43,380 --> 00:57:45,270
that was showing up.

1224
00:57:45,270 --> 00:57:47,720
And I do not know what I did--

1225
00:57:47,720 --> 00:57:49,260
let's just-- oh, yeah.

1226
00:57:49,260 --> 00:57:53,770
So we have this chi-square test
that we worked on last time,

1227
00:57:53,770 --> 00:57:54,270
right?

1228
00:57:54,270 --> 00:57:57,420
So the way I introduced the
chi-square test is by saying,

1229
00:57:57,420 --> 00:57:59,520
I am fascinated
by this question.

1230
00:57:59,520 --> 00:58:01,380
Let's check if it's correct, OK?

1231
00:58:01,380 --> 00:58:04,230
Or something maybe
slightly deeper--

1232
00:58:04,230 --> 00:58:06,570
let's check if juries
in this country

1233
00:58:06,570 --> 00:58:10,740
are representative of
racial distribution.

1234
00:58:10,740 --> 00:58:14,640
But you could actually--
those numbers here

1235
00:58:14,640 --> 00:58:16,046
come from a very specific thing.

1236
00:58:16,046 --> 00:58:16,920
That was the uniform.

1237
00:58:16,920 --> 00:58:17,878
That was our benchmark.

1238
00:58:17,878 --> 00:58:19,320
Here's the uniform.

1239
00:58:19,320 --> 00:58:21,690
And there was this guy,
which was a benchmark, which

1240
00:58:21,690 --> 00:58:24,792
was the actual benchmark that we
need to have for this problem.

1241
00:58:24,792 --> 00:58:27,000
And those things basically
came out of my hat, right?

1242
00:58:27,000 --> 00:58:29,230
Those are numbers that exist.

1243
00:58:29,230 --> 00:58:33,120
But in practice, you actually
make those numbers yourself.

1244
00:58:33,120 --> 00:58:36,360
And the way you do it
is by saying, well,

1245
00:58:36,360 --> 00:58:39,760
if I have a binomial
distribution

1246
00:58:39,760 --> 00:58:41,350
and I want to test
if my data comes

1247
00:58:41,350 --> 00:58:42,969
from a binomial
distribution, you

1248
00:58:42,969 --> 00:58:44,260
could ask this question, right?

1249
00:58:44,260 --> 00:58:45,580
You have a bunch of data.

1250
00:58:45,580 --> 00:58:48,070
I did not promise
to you that this

1251
00:58:48,070 --> 00:58:50,920
was the sum of independent
Bernoullis and [INAUDIBLE]..

1252
00:58:50,920 --> 00:58:53,800
And then you can actually check
that it's a binomial indeed,

1253
00:58:53,800 --> 00:58:55,030
and you have binomial.

1254
00:58:55,030 --> 00:58:57,580
If you think about where
you've encountered binomials,

1255
00:58:57,580 --> 00:58:59,380
it was mostly when
you were drawing balls

1256
00:58:59,380 --> 00:59:02,490
from urns, which you probably
don't do that much in practice.

1257
00:59:02,490 --> 00:59:02,990
OK?

1258
00:59:02,990 --> 00:59:05,639
And so maybe one day you want
to model things as a binomial,

1259
00:59:05,639 --> 00:59:07,430
or maybe you want to
model it as a Poisson,

1260
00:59:07,430 --> 00:59:08,800
as a limiting binomial, right?

1261
00:59:08,800 --> 00:59:11,380
People tell you photons arrive--

1262
00:59:11,380 --> 00:59:13,510
the rate of a photon
hitting some surface

1263
00:59:13,510 --> 00:59:15,460
is actually a Poisson
distribution, right?

1264
00:59:15,460 --> 00:59:18,330
That's where they
arise a lot in imaging.

1265
00:59:18,330 --> 00:59:21,100
So if I have a colleague
who's taking pictures

1266
00:59:21,100 --> 00:59:23,714
of the skies over night, and
he's like following stars

1267
00:59:23,714 --> 00:59:26,380
and it's just like moving around
with the rotation of the Earth.

1268
00:59:26,380 --> 00:59:28,637
And he has to do this
for like eight hours

1269
00:59:28,637 --> 00:59:30,970
because he needs to get enough
photons over this picture

1270
00:59:30,970 --> 00:59:32,150
to actually arise.

1271
00:59:32,150 --> 00:59:35,515
And he knows they arrive
at like a Poisson process,

1272
00:59:35,515 --> 00:59:39,830
and you know, chapter 7 of your
probability class, I guess.

1273
00:59:39,830 --> 00:59:40,650
And

1274
00:59:40,650 --> 00:59:43,330
And there's all
these distributions

1275
00:59:43,330 --> 00:59:44,890
outside the classroom
you probably

1276
00:59:44,890 --> 00:59:46,724
want to check that
they're actually correct.

1277
00:59:46,724 --> 00:59:49,139
And so the first one you might
want to check, for example,

1278
00:59:49,139 --> 00:59:49,725
is a binomial.

1279
00:59:49,725 --> 00:59:52,540
So I give you a distribution,
a binomial distribution

1280
00:59:52,540 --> 00:59:56,540
on, say, K trials, and
you have some number p.

1281
00:59:56,540 --> 00:59:59,140
And here, I don't know
typically what p should be,

1282
00:59:59,140 --> 01:00:01,824
but let's say I know it or
estimate it from my data.

1283
01:00:01,824 --> 01:00:04,240
And here, since we're only
going to deal with asymptotics,

1284
01:00:04,240 --> 01:00:07,000
just like it was the case for
the Kolmogorov-Smirnov one,

1285
01:00:07,000 --> 01:00:08,860
in the asymptotic
we're going to be

1286
01:00:08,860 --> 01:00:13,086
able to think of the estimated
p as being a true p, OK,

1287
01:00:13,086 --> 01:00:15,340
under the null at least.

1288
01:00:15,340 --> 01:00:19,180
So therefore, each outcome,
I can actually tell you what

1289
01:00:19,180 --> 01:00:20,590
the probability of a binomial--

1290
01:00:20,590 --> 01:00:21,340
is this outcome.

1291
01:00:21,340 --> 01:00:23,920
For a given K and a
given p, I can tell you

1292
01:00:23,920 --> 01:00:25,690
exactly what a binomial
should give you

1293
01:00:25,690 --> 01:00:27,800
as the probability
for the outcome.

1294
01:00:27,800 --> 01:00:33,670
And that's what I actually use
to replace the numbers 1/12,

1295
01:00:33,670 --> 01:00:41,290
1/12, 1/12, 1/12 or the
numbers 0.72, 0.7, 0.12, 0.9.

1296
01:00:41,290 --> 01:00:43,420
All these numbers I
can actually compute

1297
01:00:43,420 --> 01:00:45,640
using the probabilities
of a binomial, right?

1298
01:00:45,640 --> 01:00:52,600
So I know, for example, that the
probability that a binomial np

1299
01:00:52,600 --> 01:01:02,830
is equal to, say, K is n
choose K p to the K 1 minus p

1300
01:01:02,830 --> 01:01:05,895
to the n minus K. OK?

1301
01:01:05,895 --> 01:01:07,300
I mean, so these are numbers.

1302
01:01:07,300 --> 01:01:08,800
If you give me p
and you give me n,

1303
01:01:08,800 --> 01:01:12,710
I can compute those numbers
for all K from 0 to n.

1304
01:01:12,710 --> 01:01:14,604
And from this I can
actually build a table.

1305
01:01:22,060 --> 01:01:22,560
All right?

1306
01:01:22,560 --> 01:01:25,600
So for each K--

1307
01:01:25,600 --> 01:01:26,340
0.

1308
01:01:26,340 --> 01:01:31,020
So K is here, and
from 0, 1, et cetera,

1309
01:01:31,020 --> 01:01:35,640
all the way to n, I can compute
the true probability, which

1310
01:01:35,640 --> 01:01:40,680
is the probability that my
binomial np is equal to 0,

1311
01:01:40,680 --> 01:01:45,130
the probability that my binomial
is equal to 1, et cetera,

1312
01:01:45,130 --> 01:01:46,440
all the way to n.

1313
01:01:46,440 --> 01:01:47,610
I can compute those numbers.

1314
01:01:47,610 --> 01:01:50,560
Those are actually going
to be exact numbers, right?

1315
01:01:50,560 --> 01:01:52,952
I just plug in the
formula that I had.

1316
01:01:52,952 --> 01:01:54,660
And then I'm going to
have some observed.

1317
01:02:01,900 --> 01:02:05,460
So that's going to be p
hat, 0, and that's basically

1318
01:02:05,460 --> 01:02:12,430
the proportion of 0's, right?

1319
01:02:12,430 --> 01:02:16,542
So here you have to remember
it's not a one-time experiment

1320
01:02:16,542 --> 01:02:18,250
like you do in
probability where you say,

1321
01:02:18,250 --> 01:02:22,390
I'm going to draw n
balls from an urn,

1322
01:02:22,390 --> 01:02:24,100
and I'm counting how many--

1323
01:02:24,100 --> 01:02:25,150
how many I have.

1324
01:02:25,150 --> 01:02:25,990
This is statistics.

1325
01:02:25,990 --> 01:02:28,990
I need to be able to do
this experiment many times

1326
01:02:28,990 --> 01:02:31,910
so I can actually, in the
end, get an idea of what

1327
01:02:31,910 --> 01:02:33,810
the proportion of p's is.

1328
01:02:33,810 --> 01:02:36,100
So you have not
just one binomial,

1329
01:02:36,100 --> 01:02:38,300
but you have n binomials.

1330
01:02:38,300 --> 01:02:40,380
Well, maybe I should
not use n twice.

1331
01:02:40,380 --> 01:02:42,080
So that's why it's
the K here, right?

1332
01:02:42,080 --> 01:02:44,140
So I have a binomial
[INAUDIBLE] at Kp

1333
01:02:44,140 --> 01:02:46,405
and I just seize
n of those guys.

1334
01:02:46,405 --> 01:02:48,280
And with this n of those
guys, I can actually

1335
01:02:48,280 --> 01:02:50,072
estimate those probabilities.

1336
01:02:50,072 --> 01:02:51,530
And what I'm going
to want to check

1337
01:02:51,530 --> 01:02:53,280
is if those two
probabilities are actually

1338
01:02:53,280 --> 01:02:54,520
close to each other.

1339
01:02:54,520 --> 01:02:57,980
But I already know
how to do this.

1340
01:02:57,980 --> 01:02:58,480
All right?

1341
01:02:58,480 --> 01:03:00,130
So here I'm going
to test whether P

1342
01:03:00,130 --> 01:03:02,810
is in some parametric
family, for example,

1343
01:03:02,810 --> 01:03:06,700
binomial or not binomial.

1344
01:03:06,700 --> 01:03:09,630
And testing-- if I know that
it's a binomial [INAUDIBLE],,

1345
01:03:09,630 --> 01:03:12,870
and I basically just have to
test if P is the right thing.

1346
01:03:12,870 --> 01:03:14,460
OK?

1347
01:03:14,460 --> 01:03:17,710
Oh, sorry, I'm actually
lying to you here.

1348
01:03:17,710 --> 01:03:18,210
OK.

1349
01:03:18,210 --> 01:03:19,793
I don't want to test
if it's binomial.

1350
01:03:19,793 --> 01:03:24,220
I want to test the parameter
of the binomial here.

1351
01:03:24,220 --> 01:03:24,840
OK?

1352
01:03:24,840 --> 01:03:28,330
So I know-- no, sorry,
[INAUDIBLE] sorry.

1353
01:03:28,330 --> 01:03:28,830
OK.

1354
01:03:28,830 --> 01:03:30,960
So I want to know if
I'm in some family,

1355
01:03:30,960 --> 01:03:34,380
the family of binomials, or
not in the family of binomials.

1356
01:03:34,380 --> 01:03:35,280
OK?

1357
01:03:35,280 --> 01:03:36,910
Well, that's what I want to do.

1358
01:03:36,910 --> 01:03:39,690
And so here H0 is basically
equivalent to testing

1359
01:03:39,690 --> 01:03:42,750
if the pj's are the pj's
that come from the binomial.

1360
01:03:42,750 --> 01:03:46,170
And the pj's here are the
probabilities that I get.

1361
01:03:46,170 --> 01:03:50,180
This is the probability
that I get j successes.

1362
01:03:50,180 --> 01:03:51,180
That's my pj.

1363
01:03:51,180 --> 01:03:54,370
That's j's value here.

1364
01:03:54,370 --> 01:03:54,870
OK?

1365
01:03:54,870 --> 01:03:57,600
So this is the example,
and we know how to do this.

1366
01:03:57,600 --> 01:04:00,290
We construct p hat,
which is the estimated

1367
01:04:00,290 --> 01:04:03,230
proportion of successes
from the observations.

1368
01:04:03,230 --> 01:04:05,750
So here now I have n trials.

1369
01:04:05,750 --> 01:04:08,390
This is the actual maximum
likelihood estimator.

1370
01:04:08,390 --> 01:04:12,230
This becomes a multinomial
experiment, right?

1371
01:04:12,230 --> 01:04:13,430
So it's kind of confusing.

1372
01:04:13,430 --> 01:04:17,010
We have a multinomial experiment
for a binomial distribution.

1373
01:04:17,010 --> 01:04:19,520
The binomial here
is just a recipe

1374
01:04:19,520 --> 01:04:21,740
to create some
test probabilities.

1375
01:04:21,740 --> 01:04:22,654
That's all it is.

1376
01:04:22,654 --> 01:04:24,320
The binomial here
doesn't really matter.

1377
01:04:24,320 --> 01:04:26,539
It's really to create
the test probabilities.

1378
01:04:26,539 --> 01:04:28,830
And then I'm going to define
this test statistic, which

1379
01:04:28,830 --> 01:04:36,420
is known as the chi-square
statistic, right?

1380
01:04:36,420 --> 01:04:37,910
This was the chi-square test.

1381
01:04:37,910 --> 01:04:41,490
We just looked at sum of the
square root of the differences.

1382
01:04:41,490 --> 01:04:45,004
Inverting the covariance matrix
or using the Fisher information

1383
01:04:45,004 --> 01:04:46,920
with removing the part
that was not invertible

1384
01:04:46,920 --> 01:04:50,410
led us to actually use
this particular value here,

1385
01:04:50,410 --> 01:04:54,325
and then we had
to multiply by n.

1386
01:04:54,325 --> 01:04:55,040
OK?

1387
01:04:55,040 --> 01:04:59,710
And that, we know,
converges to what?

1388
01:04:59,710 --> 01:05:01,510
A chi-square distribution.

1389
01:05:01,510 --> 01:05:03,260
So I'm not going to
go through this again.

1390
01:05:03,260 --> 01:05:05,218
I'm just telling you you
can use the chi-square

1391
01:05:05,218 --> 01:05:08,090
that we've seen, where we just
came up with the numbers we

1392
01:05:08,090 --> 01:05:09,020
were testing.

1393
01:05:09,020 --> 01:05:12,350
Those numbers that were in this
row for the true probabilities,

1394
01:05:12,350 --> 01:05:14,224
we came up with them
out of thin air.

1395
01:05:14,224 --> 01:05:15,890
And now I'm telling
you you can actually

1396
01:05:15,890 --> 01:05:19,010
come up with those guys
from a binomial distribution

1397
01:05:19,010 --> 01:05:20,905
or a Poisson
distribution or whatever

1398
01:05:20,905 --> 01:05:22,196
distribution you're happy with.

1399
01:05:26,004 --> 01:05:26,956
Any question?

1400
01:05:30,300 --> 01:05:31,970
So now I'm creating
this thing, and I

1401
01:05:31,970 --> 01:05:34,790
can apply the entire theory
that I have for the chi-square

1402
01:05:34,790 --> 01:05:36,710
and, in particular, that
this thing converges

1403
01:05:36,710 --> 01:05:38,846
to a chi-square.

1404
01:05:38,846 --> 01:05:40,970
But if you see, there's
something that's different.

1405
01:05:40,970 --> 01:05:42,186
What is different?

1406
01:05:45,640 --> 01:05:47,850
The degrees of freedom.

1407
01:05:47,850 --> 01:05:51,990
And if you think about it,
again, the meaning of degrees

1408
01:05:51,990 --> 01:05:52,510
of freedom.

1409
01:05:52,510 --> 01:05:54,020
What does this word--

1410
01:05:54,020 --> 01:05:55,810
these words actually mean?

1411
01:05:55,810 --> 01:05:57,960
It means, well, to
which extent can I

1412
01:05:57,960 --> 01:05:59,340
play around with those values?

1413
01:05:59,340 --> 01:06:01,230
What are the possible
values that I can get?

1414
01:06:01,230 --> 01:06:03,990
If I'm not equal to this
particular value I'm testing,

1415
01:06:03,990 --> 01:06:07,500
how many directions can I
be different from this guy?

1416
01:06:07,500 --> 01:06:10,650
And when we had a
given set of values,

1417
01:06:10,650 --> 01:06:13,170
we could be any other
set of values, right?

1418
01:06:13,170 --> 01:06:16,140
So here, I had this--

1419
01:06:16,140 --> 01:06:19,890
I'm going to represent-- this
is the set of all probability

1420
01:06:19,890 --> 01:06:23,910
distributions of vectors
of size K. So here,

1421
01:06:23,910 --> 01:06:25,830
if I look at one
point in this set,

1422
01:06:25,830 --> 01:06:29,530
this is something that looks
like p1 through pK such that

1423
01:06:29,530 --> 01:06:30,360
their sum--

1424
01:06:30,360 --> 01:06:36,520
such that they're non-negative,
and the sum p1 through pK

1425
01:06:36,520 --> 01:06:37,190
is equal to 1.

1426
01:06:37,190 --> 01:06:37,690
OK?

1427
01:06:37,690 --> 01:06:40,000
So I have all those points here.

1428
01:06:40,000 --> 01:06:41,900
OK?

1429
01:06:41,900 --> 01:06:44,930
So this is basically the
set that I had before.

1430
01:06:44,930 --> 01:06:47,210
I was testing whether I
was equal to this one guy,

1431
01:06:47,210 --> 01:06:48,980
or if I was anything else.

1432
01:06:48,980 --> 01:06:51,157
And there's many ways
I can be anything else.

1433
01:06:51,157 --> 01:06:53,240
What matters, of course,
is what's around this guy

1434
01:06:53,240 --> 01:06:55,970
that I could actually
confuse myself with.

1435
01:06:55,970 --> 01:06:58,050
But there's many ways I
can move around this guy.

1436
01:06:58,050 --> 01:07:00,670
Agreed?

1437
01:07:00,670 --> 01:07:04,710
Now I'm actually just testing
something very specific.

1438
01:07:04,710 --> 01:07:06,840
I'm saying, well,
now the piece that I

1439
01:07:06,840 --> 01:07:09,180
have had to come
from this-- have

1440
01:07:09,180 --> 01:07:13,560
to be constructed from this
formula, this parametric family

1441
01:07:13,560 --> 01:07:14,840
P of theta.

1442
01:07:14,840 --> 01:07:20,130
And there's a fixed way for--
let's say this is theta,

1443
01:07:20,130 --> 01:07:23,340
so I have a theta here.

1444
01:07:23,340 --> 01:07:26,150
There's not that many ways
this can actually give me

1445
01:07:26,150 --> 01:07:28,430
a set of probabilities, right?

1446
01:07:28,430 --> 01:07:31,110
I have to move to another
theta to actually start

1447
01:07:31,110 --> 01:07:32,510
being confused.

1448
01:07:32,510 --> 01:07:34,940
And so here the number
of degrees of freedom

1449
01:07:34,940 --> 01:07:39,200
is basically, how can I
move along this family?

1450
01:07:39,200 --> 01:07:41,630
And so here, this
is all the points,

1451
01:07:41,630 --> 01:07:43,160
but there might
be just the subset

1452
01:07:43,160 --> 01:07:45,750
of the points that looks
like this, just this curve,

1453
01:07:45,750 --> 01:07:48,680
not the half of this thing.

1454
01:07:48,680 --> 01:07:56,210
And those guys on this
curve are the p thetas,

1455
01:07:56,210 --> 01:08:00,020
and that's for all thetas
when theta runs across data.

1456
01:08:00,020 --> 01:08:03,060
So in a way, this is just a
much smaller dimensional thing.

1457
01:08:03,060 --> 01:08:04,700
It's a much smaller object.

1458
01:08:04,700 --> 01:08:06,860
Those are only the
ones that I can

1459
01:08:06,860 --> 01:08:13,100
create that are exactly of this
very specific parametric form.

1460
01:08:13,100 --> 01:08:15,410
And of course, not
all are of this form.

1461
01:08:15,410 --> 01:08:19,270
Not all probability
PMFs are of this form.

1462
01:08:19,270 --> 01:08:20,939
And so that is going
to have an effect

1463
01:08:20,939 --> 01:08:24,060
on what my PMF is going to be--

1464
01:08:24,060 --> 01:08:28,830
sorry, on what my--

1465
01:08:28,830 --> 01:08:33,689
sorry, what my degrees of
freedoms are going to be.

1466
01:08:33,689 --> 01:08:39,149
Because when this thing is
very small, that means when--

1467
01:08:39,149 --> 01:08:41,170
that's happening when
theta is actually,

1468
01:08:41,170 --> 01:08:44,670
say, a one-dimensional
space, then there's still

1469
01:08:44,670 --> 01:08:46,470
many ways I can escape, right?

1470
01:08:46,470 --> 01:08:48,450
I can be different
from this guy in pretty

1471
01:08:48,450 --> 01:08:50,939
much every other direction,
except for those two

1472
01:08:50,939 --> 01:08:53,910
directions, just
when I move from here

1473
01:08:53,910 --> 01:08:56,050
or when I move in
this direction.

1474
01:08:56,050 --> 01:09:00,120
But now if this
thing becomes bigger,

1475
01:09:00,120 --> 01:09:03,399
your theta is, say,
two dimensional,

1476
01:09:03,399 --> 01:09:06,090
then when I'm here
it's becoming harder

1477
01:09:06,090 --> 01:09:07,229
for me to not be that guy.

1478
01:09:07,229 --> 01:09:08,812
If I want to move
away from it, then I

1479
01:09:08,812 --> 01:09:11,460
have to move away
from the board.

1480
01:09:11,460 --> 01:09:15,018
And so that means that
the bigger the dimension

1481
01:09:15,018 --> 01:09:18,590
of my theta, the smaller
the degrees of freedoms

1482
01:09:18,590 --> 01:09:24,810
that I have, OK, because moving
out of this parametric family

1483
01:09:24,810 --> 01:09:27,490
is actually very
difficult for me.

1484
01:09:27,490 --> 01:09:30,930
So if you think, for
example, as an extreme case,

1485
01:09:30,930 --> 01:09:36,580
the parametric family that I
have is basically all PMFs,

1486
01:09:36,580 --> 01:09:38,069
all of them, right?

1487
01:09:38,069 --> 01:09:39,710
So that's a stupid
parametric family.

1488
01:09:39,710 --> 01:09:41,890
I'm indexed by the
distribution itself,

1489
01:09:41,890 --> 01:09:43,810
but it's still
finite dimensional.

1490
01:09:43,810 --> 01:09:46,810
Then here, I have basically
no degrees of freedom.

1491
01:09:46,810 --> 01:09:48,220
There's no way I
can actually not

1492
01:09:48,220 --> 01:09:51,250
be that guy, because this
is everything I have.

1493
01:09:51,250 --> 01:09:54,220
And so you don't have
to really understand

1494
01:09:54,220 --> 01:09:59,050
how the computation comes
into the numbers of dimension

1495
01:09:59,050 --> 01:10:01,300
and what I mean by dimension
of this current space.

1496
01:10:01,300 --> 01:10:05,170
But really, what's important is
that as the dimension of theta

1497
01:10:05,170 --> 01:10:09,350
becomes bigger, I have
less degrees of freedom

1498
01:10:09,350 --> 01:10:11,640
to be away from this family.

1499
01:10:11,640 --> 01:10:13,730
This family becomes big,
and it's very hard for me

1500
01:10:13,730 --> 01:10:14,990
to violate this.

1501
01:10:14,990 --> 01:10:17,210
So it's actually shrinking
the number of degrees

1502
01:10:17,210 --> 01:10:18,907
of freedom of my chi-square.

1503
01:10:18,907 --> 01:10:20,490
And that's all you
need to understand.

1504
01:10:20,490 --> 01:10:23,240
When d increases, the number of
degrees of freedom decreases.

1505
01:10:23,240 --> 01:10:27,304
And I'd like to you to have an
idea of why this is somewhat

1506
01:10:27,304 --> 01:10:28,928
true, and this is
basically the picture

1507
01:10:28,928 --> 01:10:30,068
you should have in mind.

1508
01:10:33,240 --> 01:10:33,740
OK.

1509
01:10:33,740 --> 01:10:35,920
So now once I have done
this, I can just construct.

1510
01:10:35,920 --> 01:10:37,290
So here I need to check.

1511
01:10:37,290 --> 01:10:39,178
So what is d in the
case of the binomial?

1512
01:10:42,590 --> 01:10:43,090
AUDIENCE: 1.

1513
01:10:43,090 --> 01:10:43,570
PHILIPPE RIGOLLET: 1, right?

1514
01:10:43,570 --> 01:10:44,980
It's just a
one-dimensional thing.

1515
01:10:44,980 --> 01:10:46,396
And for most of
the examples we're

1516
01:10:46,396 --> 01:10:48,440
going to have it's going
to be one dimensional.

1517
01:10:48,440 --> 01:10:49,360
So we have this weird thing.

1518
01:10:49,360 --> 01:10:51,430
We're going to have K
minus 2 degrees of freedom.

1519
01:10:54,580 --> 01:10:59,640
So now I have this thing,
and I have this asymptotic.

1520
01:10:59,640 --> 01:11:02,310
And then I can just basically
use a test that has--

1521
01:11:02,310 --> 01:11:04,610
that uses the fact that
the asymptotic distribution

1522
01:11:04,610 --> 01:11:05,110
is this.

1523
01:11:05,110 --> 01:11:06,870
So I compute my
quantiles out of this.

1524
01:11:06,870 --> 01:11:08,210
Again, I made the same mistake.

1525
01:11:08,210 --> 01:11:11,490
This should be q alpha,
and this should be q alpha.

1526
01:11:11,490 --> 01:11:13,110
So that's just the
tail probability

1527
01:11:13,110 --> 01:11:16,699
is equal to alpha when I'm
on the right of q alpha.

1528
01:11:16,699 --> 01:11:18,240
And so those are
the tail probability

1529
01:11:18,240 --> 01:11:20,730
of the appropriate chi-square
with the appropriate number

1530
01:11:20,730 --> 01:11:22,030
of degrees of freedom.

1531
01:11:22,030 --> 01:11:24,880
And so I can compute p-values,
and I can do whatever I want.

1532
01:11:24,880 --> 01:11:25,380
OK?

1533
01:11:25,380 --> 01:11:28,510
So then I just like [INAUDIBLE]
my testing machinery.

1534
01:11:28,510 --> 01:11:29,010
OK?

1535
01:11:29,010 --> 01:11:34,960
So now I know how to test if I'm
a binomial distribution or not.

1536
01:11:34,960 --> 01:11:38,080
Again here, testing if I'm
a binomial distribution

1537
01:11:38,080 --> 01:11:40,660
is not a simple goodness of fit.

1538
01:11:40,660 --> 01:11:43,040
It's a composite one
where I can actually--

1539
01:11:43,040 --> 01:11:45,910
there's many ways I can
be a binomial distribution

1540
01:11:45,910 --> 01:11:48,260
because there's as
many as there is theta.

1541
01:11:48,260 --> 01:11:51,700
And so I'm actually plugging
in the theta hat, which is

1542
01:11:51,700 --> 01:11:54,380
estimated from the data, right?

1543
01:11:54,380 --> 01:11:57,370
And here, since everything's
happening in the asymptotics,

1544
01:11:57,370 --> 01:12:00,790
I'm not claiming that Tn
has a pivotal distribution

1545
01:12:00,790 --> 01:12:01,849
for finite n.

1546
01:12:01,849 --> 01:12:02,890
That's actually not true.

1547
01:12:02,890 --> 01:12:04,514
It's going to depend
like crazy on what

1548
01:12:04,514 --> 01:12:06,150
the actual distribution is.

1549
01:12:06,150 --> 01:12:08,170
But asymptotically,
I have a chi-square,

1550
01:12:08,170 --> 01:12:11,539
which obviously does not
depend on anything [INAUDIBLE]..

1551
01:12:11,539 --> 01:12:13,511
OK?

1552
01:12:13,511 --> 01:12:14,497
Yeah?

1553
01:12:14,497 --> 01:12:19,920
AUDIENCE: So in general, for
the binomial [INAUDIBLE] trials.

1554
01:12:19,920 --> 01:12:23,371
But in the general
case, the number of--

1555
01:12:23,371 --> 01:12:26,315
the size of our PMF is
the number of [INAUDIBLE]..

1556
01:12:26,315 --> 01:12:27,315
PHILIPPE RIGOLLET: Yeah.

1557
01:12:27,315 --> 01:12:29,287
AUDIENCE: So let's
say that I was also

1558
01:12:29,287 --> 01:12:32,738
uncertain about what
K was so that I don't

1559
01:12:32,738 --> 01:12:37,668
know how big my [INAUDIBLE] is.

1560
01:12:37,668 --> 01:12:48,580
[INAUDIBLE]

1561
01:12:48,580 --> 01:12:50,090
PHILIPPE RIGOLLET:
That is correct.

1562
01:12:50,090 --> 01:12:54,670
And thank you for this beautiful
segue into my next slide.

1563
01:12:54,670 --> 01:12:56,290
So we can actually
deal with the case

1564
01:12:56,290 --> 01:12:57,640
not only where it's
infinite, which

1565
01:12:57,640 --> 01:12:58,870
would be the case of Poisson.

1566
01:12:58,870 --> 01:13:00,244
I mean, nobody
believes I'm going

1567
01:13:00,244 --> 01:13:02,620
to get an infinite
number of photons

1568
01:13:02,620 --> 01:13:04,210
in a finite amount of time.

1569
01:13:04,210 --> 01:13:08,140
But we just don't want to have
to say there's got to be a--

1570
01:13:08,140 --> 01:13:09,910
this is the largest
possible number.

1571
01:13:09,910 --> 01:13:10,870
We don't want to
have to do that.

1572
01:13:10,870 --> 01:13:13,078
Because if you start doing
this and the probabilities

1573
01:13:13,078 --> 01:13:16,370
become close to 0, things become
degenerate and it's an issue.

1574
01:13:16,370 --> 01:13:18,220
So what we do is we bin.

1575
01:13:18,220 --> 01:13:19,890
We just bin stuff.

1576
01:13:19,890 --> 01:13:20,550
OK?

1577
01:13:20,550 --> 01:13:23,860
And so maybe if I have
a binomial distribution

1578
01:13:23,860 --> 01:13:28,400
with, say, 200,000
possible values,

1579
01:13:28,400 --> 01:13:32,082
then it's actually maybe
not the level of precision

1580
01:13:32,082 --> 01:13:33,040
I want to look at this.

1581
01:13:33,040 --> 01:13:33,870
Maybe I want to bin.

1582
01:13:33,870 --> 01:13:35,411
Maybe I want to say,
let's just think

1583
01:13:35,411 --> 01:13:37,450
of all things that
are between 0 and 100

1584
01:13:37,450 --> 01:13:40,765
to be the same thing, between
100 and 200 the same thing,

1585
01:13:40,765 --> 01:13:41,710
et cetera.

1586
01:13:41,710 --> 01:13:44,064
And so in fact, I'm
actually going to bin.

1587
01:13:44,064 --> 01:13:46,480
I don't even have to think
about things that are discrete.

1588
01:13:46,480 --> 01:13:49,120
I can even think about
continuous cases.

1589
01:13:49,120 --> 01:13:51,850
And so if I want to test if I
have a Gaussian distribution,

1590
01:13:51,850 --> 01:13:55,420
for example, I can just
approximate that by some,

1591
01:13:55,420 --> 01:13:59,590
say, piecewise constant
function that just says that,

1592
01:13:59,590 --> 01:14:03,370
well, if I have a Gaussian
distribution like this,

1593
01:14:03,370 --> 01:14:06,484
I'm going to bin it like this.

1594
01:14:06,484 --> 01:14:08,650
And I'm going to say, well,
the probability that I'm

1595
01:14:08,650 --> 01:14:10,150
less than this value is this.

1596
01:14:10,150 --> 01:14:12,692
The probability that I'm between
this and this value is this.

1597
01:14:12,692 --> 01:14:14,650
The probability I'm
between this and this value

1598
01:14:14,650 --> 01:14:18,370
is this, and then this
and then this, right?

1599
01:14:18,370 --> 01:14:19,650
And now I've turned--

1600
01:14:19,650 --> 01:14:24,240
I've discretized, effectively,
my Gaussian into a PMF.

1601
01:14:24,240 --> 01:14:26,140
The value-- this is p1.

1602
01:14:26,140 --> 01:14:28,510
The value here is p1.

1603
01:14:28,510 --> 01:14:30,460
This is p2.

1604
01:14:30,460 --> 01:14:32,800
This is p3.

1605
01:14:32,800 --> 01:14:35,230
This is p4.

1606
01:14:35,230 --> 01:14:39,150
This is p5 and p6, right?

1607
01:14:39,150 --> 01:14:41,920
I have discretized my Gaussian
into six possible values.

1608
01:14:41,920 --> 01:14:46,650
That's just the probability that
they fall into a certain bin.

1609
01:14:46,650 --> 01:14:47,865
And we can do this--

1610
01:14:47,865 --> 01:14:51,590
if you don't know what
K is, just stop at 10.

1611
01:14:51,590 --> 01:14:54,360
You look at your data quickly
and you say, well, you know,

1612
01:14:54,360 --> 01:15:00,180
I have so few of them that are--
like I see maybe one 8, one 11,

1613
01:15:00,180 --> 01:15:01,590
and one 15.

1614
01:15:01,590 --> 01:15:03,270
Well, everything
that's between 8 and 20

1615
01:15:03,270 --> 01:15:05,130
I'm just going to
put it in one bin.

1616
01:15:05,130 --> 01:15:07,020
Because what else
are you going to do?

1617
01:15:07,020 --> 01:15:09,490
I mean, you just don't
have enough observations.

1618
01:15:09,490 --> 01:15:11,710
And so what we do is
we just bin everything.

1619
01:15:11,710 --> 01:15:14,460
So here I'm going to actually
be slightly abstract.

1620
01:15:14,460 --> 01:15:16,922
Our bins are going
to be intervals Aj.

1621
01:15:16,922 --> 01:15:18,880
So here-- they don't even
have to be intervals.

1622
01:15:18,880 --> 01:15:21,930
I could go crazy and just
like call the bin this guy

1623
01:15:21,930 --> 01:15:23,370
and this guy, right?

1624
01:15:23,370 --> 01:15:27,110
That would make no sense,
but I could do that.

1625
01:15:27,110 --> 01:15:30,620
And then I'm-- and of course,
you can do whatever you want,

1626
01:15:30,620 --> 01:15:33,180
but there's going to be some
consequences in the conclusions

1627
01:15:33,180 --> 01:15:34,490
that you can take, right?

1628
01:15:34,490 --> 01:15:35,906
All you're going
to be able to say

1629
01:15:35,906 --> 01:15:38,790
is that my distribution
does not look like it

1630
01:15:38,790 --> 01:15:40,800
could be binned in this way.

1631
01:15:40,800 --> 01:15:42,570
That's all you're going
to be able to say.

1632
01:15:42,570 --> 01:15:46,800
So if you decide to just
put all the negative numbers

1633
01:15:46,800 --> 01:15:48,357
and the positive
numbers, then it's

1634
01:15:48,357 --> 01:15:50,190
going to be very hard
for you to distinguish

1635
01:15:50,190 --> 01:15:52,314
a Gaussian from a random
variable that takes values

1636
01:15:52,314 --> 01:15:54,110
of minus 1 and plus 1 only.

1637
01:15:54,110 --> 01:15:57,490
You need to just be reasonable.

1638
01:15:57,490 --> 01:15:57,990
OK?

1639
01:15:57,990 --> 01:16:00,720
So now I have my pj's
become the probability

1640
01:16:00,720 --> 01:16:02,590
that my random variable
falls into bin j.

1641
01:16:06,600 --> 01:16:10,290
So that's pj of theta under
the parametric distribution.

1642
01:16:10,290 --> 01:16:14,270
For the true one, whether it's
parametric or not, I have a pj.

1643
01:16:14,270 --> 01:16:15,870
And then I have
p hat j, which is

1644
01:16:15,870 --> 01:16:19,030
the proportion of observations
that falls in this bin.

1645
01:16:19,030 --> 01:16:19,530
All right?

1646
01:16:19,530 --> 01:16:21,030
So I have a bunch
of observations.

1647
01:16:21,030 --> 01:16:23,250
I count how many of
them fall in this bin.

1648
01:16:23,250 --> 01:16:26,130
I divide by n, and that
tells me what my estimated

1649
01:16:26,130 --> 01:16:29,410
probability for this bin is.

1650
01:16:29,410 --> 01:16:31,444
And theta hat, well,
it's the same as before.

1651
01:16:31,444 --> 01:16:32,860
If I'm in a
parametric family, I'm

1652
01:16:32,860 --> 01:16:35,151
just estimating theta hat,
maybe the maximum likelihood

1653
01:16:35,151 --> 01:16:37,690
estimator, plug it
in, and estimate

1654
01:16:37,690 --> 01:16:39,700
those pj's of theta hat.

1655
01:16:39,700 --> 01:16:43,390
From this, I form my
chi-square, and I have exactly

1656
01:16:43,390 --> 01:16:45,230
the same thing as before.

1657
01:16:45,230 --> 01:16:48,680
So the answer to your
question is, yes, you bin.

1658
01:16:48,680 --> 01:16:51,690
And it's the answer to
even more questions.

1659
01:16:51,690 --> 01:16:53,390
So that's why there
you can actually

1660
01:16:53,390 --> 01:16:56,420
use the chi-square test
to test for normality.

1661
01:16:56,420 --> 01:16:58,850
Now here it's going
to be slightly weaker,

1662
01:16:58,850 --> 01:17:00,800
because there's only
an asymptotic theory,

1663
01:17:00,800 --> 01:17:03,920
whereas Kolmogorov-Smirnov
and Kolmogorov-Lilliefors work

1664
01:17:03,920 --> 01:17:06,230
actually even for
finite samples.

1665
01:17:06,230 --> 01:17:08,600
For the chi-square test,
it's only asymptotic.

1666
01:17:08,600 --> 01:17:11,300
So you just pretend you actually
know what the parameters are.

1667
01:17:11,300 --> 01:17:15,250
You just stuff them
into a theta, a mu hat,

1668
01:17:15,250 --> 01:17:16,670
and sigma square hat.

1669
01:17:16,670 --> 01:17:19,280
And you just go to-- you
just cross your finger

1670
01:17:19,280 --> 01:17:21,020
that n is large
enough for everything

1671
01:17:21,020 --> 01:17:24,161
to have converged by the
time you make your decision.

1672
01:17:24,161 --> 01:17:24,660
OK?

1673
01:17:24,660 --> 01:17:28,440
And then this is a copy/paste,
with the same error actually

1674
01:17:28,440 --> 01:17:31,710
as the previous slide, where
you just build your test based

1675
01:17:31,710 --> 01:17:34,560
on whether you exceed
or not some quantile,

1676
01:17:34,560 --> 01:17:37,721
and you can also
compute some p-value.

1677
01:17:37,721 --> 01:17:38,220
OK?

1678
01:17:38,220 --> 01:17:39,120
AUDIENCE: The error?

1679
01:17:39,120 --> 01:17:40,328
PHILIPPE RIGOLLET: I'm sorry?

1680
01:17:40,328 --> 01:17:41,559
AUDIENCE: What's the error?

1681
01:17:41,559 --> 01:17:43,100
PHILIPPE RIGOLLET:
What is the error?

1682
01:17:43,100 --> 01:17:45,575
AUDIENCE: You said [INAUDIBLE]
copy/paste [INAUDIBLE]..

1683
01:17:45,575 --> 01:17:47,450
PHILIPPE RIGOLLET: Oh,
the error is that this

1684
01:17:47,450 --> 01:17:48,520
should be q alpha, right?

1685
01:17:48,520 --> 01:17:49,190
AUDIENCE: OK.

1686
01:17:49,190 --> 01:17:51,273
PHILIPPE RIGOLLET: I've
been calling this q alpha.

1687
01:17:51,273 --> 01:17:53,459
I mean, that's my
personal choice,

1688
01:17:53,459 --> 01:17:54,500
because I don't want to--

1689
01:17:54,500 --> 01:17:55,820
I only use q alpha.

1690
01:17:55,820 --> 01:17:59,644
So I only use quantiles where
alpha is to the right, so.

1691
01:17:59,644 --> 01:18:01,310
That's what statisticians--
probabilists

1692
01:18:01,310 --> 01:18:02,970
would use this notation.

1693
01:18:07,041 --> 01:18:07,540
OK.

1694
01:18:07,540 --> 01:18:10,000
And so some questions, right?

1695
01:18:10,000 --> 01:18:11,820
So of course, in
practice you're going

1696
01:18:11,820 --> 01:18:13,650
to have some issues
which translate.

1697
01:18:13,650 --> 01:18:16,010
I say, well, how do you
pick this guy, this K?

1698
01:18:16,010 --> 01:18:17,610
So I gave you some sort of a--

1699
01:18:17,610 --> 01:18:19,810
I mean, the way we
discussed, right?

1700
01:18:19,810 --> 01:18:23,220
You have 8 and 10 and
20, then it's ad hoc.

1701
01:18:23,220 --> 01:18:27,120
And so depending on whether
you want to stop K at 20

1702
01:18:27,120 --> 01:18:29,610
or if you want to bin those
guys is really up to you.

1703
01:18:29,610 --> 01:18:31,050
And there's going to
be some considerations

1704
01:18:31,050 --> 01:18:32,591
about the particular
problem at hand.

1705
01:18:32,591 --> 01:18:34,180
I mean, is it
coarse-- too coarse

1706
01:18:34,180 --> 01:18:38,070
for your problem to decide
that the observations between 8

1707
01:18:38,070 --> 01:18:39,644
and 20 are the same?

1708
01:18:39,644 --> 01:18:40,560
It's really up to you.

1709
01:18:40,560 --> 01:18:42,476
Maybe that's actually
making a huge difference

1710
01:18:42,476 --> 01:18:45,420
in terms of what phenomenon
you're looking at.

1711
01:18:45,420 --> 01:18:46,770
The choice of the bins, right?

1712
01:18:46,770 --> 01:18:48,450
So here there's
actually some sort

1713
01:18:48,450 --> 01:18:51,870
of rules, which are
don't use only one bin

1714
01:18:51,870 --> 01:18:55,200
and make sure there's actually--
don't use them too small

1715
01:18:55,200 --> 01:18:57,710
so that there's at least one
observation per bin, right?

1716
01:18:57,710 --> 01:18:59,010
And it's basically
the same kind of rules

1717
01:18:59,010 --> 01:19:00,360
that you would have
to build a histogram.

1718
01:19:00,360 --> 01:19:02,280
If you were to build a
histogram for your data,

1719
01:19:02,280 --> 01:19:03,780
you still want to
make sure that you

1720
01:19:03,780 --> 01:19:05,030
bin in an appropriate fashion.

1721
01:19:05,030 --> 01:19:05,530
OK?

1722
01:19:05,530 --> 01:19:08,052
And there's a bunch
of rule of thumbs.

1723
01:19:08,052 --> 01:19:09,510
Every time you ask
someone, they're

1724
01:19:09,510 --> 01:19:11,176
going to have a
different rule of thumb,

1725
01:19:11,176 --> 01:19:13,850
so just make your own.

1726
01:19:13,850 --> 01:19:17,580
And then there's the
computation of pj

1727
01:19:17,580 --> 01:19:19,530
of theta, which might
be a bit complicated

1728
01:19:19,530 --> 01:19:21,450
because, in this
case, I would have

1729
01:19:21,450 --> 01:19:24,030
to integrate the Gaussian
between this number

1730
01:19:24,030 --> 01:19:25,270
and this number.

1731
01:19:25,270 --> 01:19:27,120
So for this case, I
could just say, well,

1732
01:19:27,120 --> 01:19:30,150
it's the difference of the CDF
in that value and that value

1733
01:19:30,150 --> 01:19:31,440
and then be happy with it.

1734
01:19:31,440 --> 01:19:33,606
But you can imagine that
you have some slightly more

1735
01:19:33,606 --> 01:19:34,574
crazy distributions.

1736
01:19:34,574 --> 01:19:36,240
You're going to have
to somewhat compute

1737
01:19:36,240 --> 01:19:39,630
some integrals that might be
unpleasant for you to compute.

1738
01:19:39,630 --> 01:19:40,180
OK?

1739
01:19:40,180 --> 01:19:41,846
And in particular, I
said the difference

1740
01:19:41,846 --> 01:19:44,680
of the PDF between that value
and that value of-- sorry,

1741
01:19:44,680 --> 01:19:47,722
the CDF between that value
and that value, it is true.

1742
01:19:47,722 --> 01:19:49,180
But it's not like
you actually have

1743
01:19:49,180 --> 01:19:52,480
tables that compute the CDF
at any value you like, right?

1744
01:19:52,480 --> 01:19:54,445
You have to sort of--

1745
01:19:54,445 --> 01:19:56,212
well, there might be
but at some degree,

1746
01:19:56,212 --> 01:19:58,420
but you are going to have
to use a computer typically

1747
01:19:58,420 --> 01:20:01,050
to do that.

1748
01:20:01,050 --> 01:20:01,550
OK?

1749
01:20:01,550 --> 01:20:05,270
And so for example, you
could do the Poisson.

1750
01:20:05,270 --> 01:20:07,489
If I had time, if I had
more than one minute,

1751
01:20:07,489 --> 01:20:08,780
I would actually do it for you.

1752
01:20:08,780 --> 01:20:10,340
But it's basically the same.

1753
01:20:10,340 --> 01:20:12,560
The Poisson, you are going
to have an infinite tail,

1754
01:20:12,560 --> 01:20:14,018
and you just say,
at some point I'm

1755
01:20:14,018 --> 01:20:16,560
going to cut everything
that's larger than some value.

1756
01:20:16,560 --> 01:20:17,060
All right?

1757
01:20:17,060 --> 01:20:20,727
So you can play around, right?

1758
01:20:20,727 --> 01:20:23,310
I say, well, if you have extra
knowledge about what you expect

1759
01:20:23,310 --> 01:20:26,000
to see, maybe you can
cut at a certain number

1760
01:20:26,000 --> 01:20:30,530
and then just fold all the
largest values from K minus 1

1761
01:20:30,530 --> 01:20:35,630
to infinity so that
you actually have--

1762
01:20:35,630 --> 01:20:37,891
you have everything
into one large bin.

1763
01:20:37,891 --> 01:20:38,390
OK?

1764
01:20:38,390 --> 01:20:39,980
That's the entire tail.

1765
01:20:39,980 --> 01:20:42,350
And that's the way people do
it in insurance companies,

1766
01:20:42,350 --> 01:20:42,869
for example.

1767
01:20:42,869 --> 01:20:45,410
They assume that the number of
accidents you're going to have

1768
01:20:45,410 --> 01:20:47,300
is a Poisson distribution.

1769
01:20:47,300 --> 01:20:48,620
They have to fit it to you.

1770
01:20:48,620 --> 01:20:49,680
They have to know--

1771
01:20:49,680 --> 01:20:52,970
or at least to your pool of
insurance of injured people.

1772
01:20:52,970 --> 01:20:56,390
So they just slice you
into what your character--

1773
01:20:56,390 --> 01:20:58,187
relevant characteristics
are, and then they

1774
01:20:58,187 --> 01:21:00,270
want to estimate what the
Poisson distribution is.

1775
01:21:00,270 --> 01:21:03,760
And basically, they can
do a chi-square test

1776
01:21:03,760 --> 01:21:06,980
to check if it's indeed
a Poisson distribution.

1777
01:21:06,980 --> 01:21:07,480
All right.

1778
01:21:07,480 --> 01:21:10,070
So that will be it for today.

1779
01:21:10,070 --> 01:21:11,330
And so I'll be--

1780
01:21:11,330 --> 01:21:13,800
I'll have your homework--