1
00:00:00,120 --> 00:00:02,460
The following content is
provided under a Creative
2
00:00:02,460 --> 00:00:03,880
Commons license.
3
00:00:03,880 --> 00:00:06,090
Your support will help
MIT OpenCourseWare
4
00:00:06,090 --> 00:00:10,180
continue to offer high quality
educational resources for free.
5
00:00:10,180 --> 00:00:12,720
To make a donation or to
view additional materials
6
00:00:12,720 --> 00:00:16,680
from hundreds of MIT courses,
visit MIT OpenCourseWare
7
00:00:16,680 --> 00:00:17,620
at ocw.mit.edu.
8
00:01:14,980 --> 00:01:17,500
PHILIPPE RIGOLLET: --bunch
of x's and a bunch of y's.
9
00:01:17,500 --> 00:01:20,140
The y's were univariate,
just one real
10
00:01:20,140 --> 00:01:21,460
valued random variable.
11
00:01:21,460 --> 00:01:24,760
And the x's were vectors that
described a bunch of attributes
12
00:01:24,760 --> 00:01:27,730
for each of our individuals
or each of our observations.
13
00:01:27,730 --> 00:01:30,350
Let's assume now that we're
given essentially only the x's.
14
00:01:30,350 --> 00:01:33,970
This is sometimes referred
to as unsupervised learning.
15
00:01:33,970 --> 00:01:35,920
There is just the x's.
16
00:01:35,920 --> 00:01:38,640
Usually, supervision
is done by the y's.
17
00:01:38,640 --> 00:01:41,710
And so what you're trying to do
is to make sense of this data.
18
00:01:41,710 --> 00:01:43,690
You're going to try to
understand this data,
19
00:01:43,690 --> 00:01:47,062
represent this data,
visualize this data,
20
00:01:47,062 --> 00:01:48,520
try to understand
something, right?
21
00:01:48,520 --> 00:01:52,196
So, if I give you a
d-dimensional random vectors,
22
00:01:52,196 --> 00:01:54,070
and you're going to have
n independent copies
23
00:01:54,070 --> 00:01:57,310
of this individual-- of
this random vector, OK?
24
00:01:57,310 --> 00:01:59,530
So you will see that
I'm going to have--
25
00:01:59,530 --> 00:02:02,200
I'm going to very quickly
run into some limitations
26
00:02:02,200 --> 00:02:04,270
about what I can actually
draw on the board
27
00:02:04,270 --> 00:02:05,980
because I'm using
[? boldface ?] here.
28
00:02:05,980 --> 00:02:08,180
I'm also going to use the
blackboard [? boldface. ?]
29
00:02:08,180 --> 00:02:09,820
So it's going to
be a bit difficult.
30
00:02:09,820 --> 00:02:15,430
So tell me if you're actually
a little confused by what
31
00:02:15,430 --> 00:02:17,710
is a vector, what is a
number, and what is a matrix.
32
00:02:17,710 --> 00:02:19,720
But we'll get there.
33
00:02:19,720 --> 00:02:22,450
So I have X in Rd, and
that's a random vector.
34
00:02:26,230 --> 00:02:30,650
And I have X1 to
Xn that are IID.
35
00:02:30,650 --> 00:02:37,635
They're independent
copies of X. OK,
36
00:02:37,635 --> 00:02:40,326
so you can think
of those as being--
37
00:02:40,326 --> 00:02:41,700
the realization
of these guys are
38
00:02:41,700 --> 00:02:51,090
going to be a cloud of
n points in R to the d.
39
00:02:51,090 --> 00:02:54,210
And we're going to think
of d as being fairly large.
40
00:02:54,210 --> 00:02:55,710
And for this to
start to make sense,
41
00:02:55,710 --> 00:02:59,760
we're going to think of d
as being at least 4, OK?
42
00:02:59,760 --> 00:03:01,830
And meaning that you're
going to have a hard time
43
00:03:01,830 --> 00:03:03,480
visualizing those things.
44
00:03:03,480 --> 00:03:06,530
If it was 3 or 2, you would
be able to draw these points.
45
00:03:06,530 --> 00:03:08,040
And that's pretty
much as much sense
46
00:03:08,040 --> 00:03:09,831
you're going to be
making about those guys,
47
00:03:09,831 --> 00:03:12,030
just looking at the [INAUDIBLE]
48
00:03:12,030 --> 00:03:16,860
All right, so I'm going to
write each of those X's, right?
49
00:03:16,860 --> 00:03:20,520
So this vector, X,
has d coordinate.
50
00:03:20,520 --> 00:03:25,650
And I'm going to write
them as X1, to Xd.
51
00:03:30,730 --> 00:03:34,780
And I'm going to stack
them into a matrix, OK?
52
00:03:34,780 --> 00:03:38,100
So once I have those guys,
I'm going to have a matrix.
53
00:03:38,100 --> 00:03:40,230
But here, I'm going
to use the double bar.
54
00:03:40,230 --> 00:03:47,880
And it's X1 transpose,
Xn transpose.
55
00:03:47,880 --> 00:03:51,250
So what it means is that
the coordinates of this guy,
56
00:03:51,250 --> 00:03:53,040
of course, are X1,1.
57
00:03:53,040 --> 00:03:54,710
Here, I have--
58
00:03:54,710 --> 00:03:57,870
I'm of size d, so I have X1d.
59
00:03:57,870 --> 00:04:01,290
And here, I have Xn1.
60
00:04:01,290 --> 00:04:02,940
Xnd.
61
00:04:02,940 --> 00:04:06,660
And so the i-th, j-th--
62
00:04:06,660 --> 00:04:10,950
i-th row and j-th column
is the matrix, Xij, right--
63
00:04:10,950 --> 00:04:12,780
is the entry, Xi to-- sorry.
64
00:04:23,540 --> 00:04:28,230
OK, so each-- so the rows
here are the observations.
65
00:04:28,230 --> 00:04:32,040
And the columns are the
covariance over attributes.
66
00:04:32,040 --> 00:04:32,640
OK?
67
00:04:32,640 --> 00:04:34,060
So this is an n by d matrix.
68
00:04:39,220 --> 00:04:41,320
All right, this is really
just some bookkeeping.
69
00:04:41,320 --> 00:04:43,840
How do we store
this data somehow?
70
00:04:43,840 --> 00:04:46,257
And the fact that we use a
matrix just like for regression
71
00:04:46,257 --> 00:04:48,464
is going to be convenient
because we're going to able
72
00:04:48,464 --> 00:04:50,050
to talk about projections--
73
00:04:50,050 --> 00:04:53,310
going to be able to talk
about things like this.
74
00:04:53,310 --> 00:04:56,310
All right, so everything
I'm going to say now
75
00:04:56,310 --> 00:04:59,190
is about variances
or covariances
76
00:04:59,190 --> 00:05:01,945
of those things, which means
that I need two moments, OK?
77
00:05:01,945 --> 00:05:03,570
If the variance does
not exist, there's
78
00:05:03,570 --> 00:05:05,320
nothing I can say
about this problem.
79
00:05:05,320 --> 00:05:07,620
So I'm going to assume
that the variance exists.
80
00:05:07,620 --> 00:05:09,090
And one way to
just put it to say
81
00:05:09,090 --> 00:05:12,390
that the two norm
of those guys is
82
00:05:12,390 --> 00:05:15,030
finite, which is another
way to say that each of them
83
00:05:15,030 --> 00:05:15,690
is finite.
84
00:05:15,690 --> 00:05:18,210
I mean, you can think
of it the way you want.
85
00:05:18,210 --> 00:05:21,000
All right, so now,
the mean of X, right?
86
00:05:21,000 --> 00:05:22,530
So I have a random vector.
87
00:05:22,530 --> 00:05:26,430
So I can talk about
the expectation of X.
88
00:05:26,430 --> 00:05:29,040
That's a vector that's in Rd.
89
00:05:29,040 --> 00:05:33,828
And that's just taking
the expectation entrywise.
90
00:05:33,828 --> 00:05:34,328
Sorry.
91
00:05:42,265 --> 00:05:45,540
X1, Xd.
92
00:05:45,540 --> 00:05:49,640
OK, so I should say it out loud.
93
00:05:49,640 --> 00:05:51,890
For this, the purpose
of this class,
94
00:05:51,890 --> 00:05:55,850
I will denote by
subscripts the indices that
95
00:05:55,850 --> 00:05:57,170
corresponds to observations.
96
00:05:57,170 --> 00:06:02,690
And superscripts, the
indices that correspond to
97
00:06:02,690 --> 00:06:04,280
coordinates of a variable.
98
00:06:04,280 --> 00:06:07,340
And I think that's the
same convention that we
99
00:06:07,340 --> 00:06:10,599
took for the regression case.
100
00:06:10,599 --> 00:06:12,390
Of course, you could
use whatever you want.
101
00:06:12,390 --> 00:06:13,931
If you want to put
commas, et cetera,
102
00:06:13,931 --> 00:06:16,072
it becomes just a
bit more complicated.
103
00:06:16,072 --> 00:06:18,070
All right, and so
now, once I have this,
104
00:06:18,070 --> 00:06:21,380
so this tells me where my cloud
of point is centered, right?
105
00:06:21,380 --> 00:06:24,380
So if I have a bunch of points--
106
00:06:24,380 --> 00:06:27,440
OK, so now I have a
distribution on Rd,
107
00:06:27,440 --> 00:06:29,990
so maybe I should
talk about this--
108
00:06:29,990 --> 00:06:31,610
I'll talk about
this when we talk
109
00:06:31,610 --> 00:06:32,960
about the empirical version.
110
00:06:32,960 --> 00:06:34,460
But if you think
that you have, say,
111
00:06:34,460 --> 00:06:36,680
a two-dimensional
Gaussian random variable,
112
00:06:36,680 --> 00:06:38,930
then you have a center
in two dimension, which
113
00:06:38,930 --> 00:06:41,572
is where it peaks, basically.
114
00:06:41,572 --> 00:06:43,280
And that's what we're
talking about here.
115
00:06:43,280 --> 00:06:44,738
But the other thing
we want to know
116
00:06:44,738 --> 00:06:47,545
is how much does it spread
in every direction, right?
117
00:06:47,545 --> 00:06:49,670
So in every direction of
the two dimensional thing,
118
00:06:49,670 --> 00:06:52,220
I can then try to understand
how much spread I'm getting.
119
00:06:52,220 --> 00:06:54,900
And the way you measure this
is by using covariance, right?
120
00:06:54,900 --> 00:07:02,150
So the covariance
matrix, sigma--
121
00:07:02,150 --> 00:07:05,900
that's a matrix which is d by d.
122
00:07:05,900 --> 00:07:08,150
And it records-- in
the j, k-th entry,
123
00:07:08,150 --> 00:07:10,620
it records the covariance
between the j-th coordinate
124
00:07:10,620 --> 00:07:13,490
of X and the k-th
coordinate of X, OK?
125
00:07:13,490 --> 00:07:14,570
So with entries--
126
00:07:21,300 --> 00:07:30,510
OK, so I have sigma, which is
sigma 1,1, sigma dd, sigma 1d,
127
00:07:30,510 --> 00:07:31,175
sigma d1.
128
00:07:34,750 --> 00:07:39,690
OK, and here I have
sigma jk And sigma jk
129
00:07:39,690 --> 00:07:48,930
is just the covariance between
Xj, the j-th coordinate
130
00:07:48,930 --> 00:07:52,160
and the k-th coordinate.
131
00:07:52,160 --> 00:07:52,869
OK?
132
00:07:52,869 --> 00:07:55,160
So in particular, it's
symmetric because the covariance
133
00:07:55,160 --> 00:07:57,780
between Xj and Xk is the same
as the covariance between Xk
134
00:07:57,780 --> 00:07:58,280
and Xj.
135
00:07:58,280 --> 00:08:01,230
I should not put those
parentheses here.
136
00:08:01,230 --> 00:08:05,330
I do not use them in this, OK?
137
00:08:05,330 --> 00:08:06,900
Just the covariance matrix.
138
00:08:06,900 --> 00:08:09,050
So that's just something
that records everything.
139
00:08:09,050 --> 00:08:10,966
And so what's nice about
the covariance matrix
140
00:08:10,966 --> 00:08:13,040
is that if I actually
give you X as a vector,
141
00:08:13,040 --> 00:08:15,170
you actually can
build the matrix just
142
00:08:15,170 --> 00:08:18,140
by looking at vectors
times vectors transpose,
143
00:08:18,140 --> 00:08:20,210
rather than actually
thinking about building
144
00:08:20,210 --> 00:08:21,882
it coordinate by coordinate.
145
00:08:21,882 --> 00:08:23,840
So for example, if you're
used to using MATLAB,
146
00:08:23,840 --> 00:08:26,006
that's the way you want to
build a covariance matrix
147
00:08:26,006 --> 00:08:29,600
because MATLAB is good
at manipulating vectors
148
00:08:29,600 --> 00:08:33,049
and matrices rather than just
entering it entry by entry.
149
00:08:33,049 --> 00:08:34,820
OK, so, right?
150
00:08:34,820 --> 00:08:42,590
So, what is the covariance
between Xj and Xk?
151
00:08:42,590 --> 00:08:51,360
Well by definition, it's
the expectation of Xj and Xk
152
00:08:51,360 --> 00:09:01,330
minus the expectation of Xj
times the expectation of Xk,
153
00:09:01,330 --> 00:09:01,830
right?
154
00:09:01,830 --> 00:09:03,496
That's the definition
of the covariance.
155
00:09:03,496 --> 00:09:05,770
I hope everybody's seeing that.
156
00:09:05,770 --> 00:09:08,280
And so, in particular,
I can actually
157
00:09:08,280 --> 00:09:10,620
see that this thing
can be written as--
158
00:09:10,620 --> 00:09:14,340
sigma can now be written
as the expectation
159
00:09:14,340 --> 00:09:21,040
of XX transpose minus
the expectation of X
160
00:09:21,040 --> 00:09:25,660
times the expectation
of X transpose.
161
00:09:25,660 --> 00:09:26,500
Why?
162
00:09:26,500 --> 00:09:29,470
Well, let's look at the jk-th
coefficient of this guy, right?
163
00:09:29,470 --> 00:09:35,650
So here, if I look at the
jk-th coefficient, I see what?
164
00:09:35,650 --> 00:09:38,980
Well, I see that
it's the expectation
165
00:09:38,980 --> 00:09:50,840
of XX transpose jk, which is
equal to the expectation of XX
166
00:09:50,840 --> 00:09:53,920
transpose jk.
167
00:09:53,920 --> 00:09:56,570
And what are the
entries of XX transpose?
168
00:09:56,570 --> 00:10:00,130
Well, they're of the
form, Xj times Xk exactly.
169
00:10:00,130 --> 00:10:02,940
So this is actually equal to
the expectation of Xj times Xk.
170
00:10:09,060 --> 00:10:11,250
And this is actually not
the way I want to write it.
171
00:10:11,250 --> 00:10:12,083
I want to write it--
172
00:10:15,530 --> 00:10:16,590
OK?
173
00:10:16,590 --> 00:10:17,590
Is that clear?
174
00:10:17,590 --> 00:10:20,420
That when I have a rank 1 matrix
of this form, XX transpose,
175
00:10:20,420 --> 00:10:21,950
the entries are of
this form, right?
176
00:10:21,950 --> 00:10:23,520
Because if I take--
177
00:10:23,520 --> 00:10:28,865
for example, think
about x, y, z, and then
178
00:10:28,865 --> 00:10:32,810
I multiply by x, y, z.
179
00:10:32,810 --> 00:10:36,380
What I'm getting here is x--
180
00:10:36,380 --> 00:10:40,350
maybe I should actually
use indices here.
181
00:10:40,350 --> 00:10:42,735
x1, x2, x3.
182
00:10:42,735 --> 00:10:44,750
x1, x2, x3.
183
00:10:44,750 --> 00:10:57,018
The entries are x1x1, x1x2,
x1x3; x2x1, x2x2, x2x3; x3x1,
184
00:10:57,018 --> 00:11:04,770
x3x2, x3x3, OK?
185
00:11:04,770 --> 00:11:08,340
So indeed, this is exactly of
the form if you look at jk,
186
00:11:08,340 --> 00:11:12,566
you get exactly Xj times Xk, OK?
187
00:11:12,566 --> 00:11:15,685
So that's the beauty
of those matrices.
188
00:11:15,685 --> 00:11:19,380
So now, once I have this, I
can do exactly the same thing,
189
00:11:19,380 --> 00:11:23,480
except that here, if I
take the jk-th entry,
190
00:11:23,480 --> 00:11:25,044
I will get exactly
the same thing,
191
00:11:25,044 --> 00:11:27,710
except that it's not going to be
the expectation of the product,
192
00:11:27,710 --> 00:11:29,780
but the product of the
expectation, right?
193
00:11:29,780 --> 00:11:36,810
So I get that the jk-th entry
of E of X, E of X transpose,
194
00:11:36,810 --> 00:11:48,310
is just the j-th entry of E of X
times the k-th entry of E of X.
195
00:11:48,310 --> 00:11:52,540
So if I put those two together,
it's actually telling me
196
00:11:52,540 --> 00:11:56,990
that if I look at the
j, k-th entry of sigma,
197
00:11:56,990 --> 00:11:59,690
which I called
little sigma jk, then
198
00:11:59,690 --> 00:12:01,340
this is actually equal to what?
199
00:12:01,340 --> 00:12:04,170
It's equal to the first
term minus the second term.
200
00:12:04,170 --> 00:12:11,420
The first term is the
expectation of Xj, Xk
201
00:12:11,420 --> 00:12:18,900
minus the expectation of Xj,
expectation of Xk, which--
202
00:12:18,900 --> 00:12:20,900
oh, by the way, I forgot
to say this is actually
203
00:12:20,900 --> 00:12:26,022
equal to the expectation of
Xj times the expectation of Xk
204
00:12:26,022 --> 00:12:28,230
because that's just the
definition of the expectation
205
00:12:28,230 --> 00:12:28,979
of random vectors.
206
00:12:28,979 --> 00:12:31,460
So my j and my k are now inside.
207
00:12:31,460 --> 00:12:37,175
And that's by definition the
covariance between Xj and Xk,
208
00:12:37,175 --> 00:12:39,550
OK?
209
00:12:39,550 --> 00:12:43,360
So just if you've seen those
manipulations between vectors,
210
00:12:43,360 --> 00:12:45,400
hopefully you're bored
out of your mind.
211
00:12:45,400 --> 00:12:47,800
And if you have not,
then that's something
212
00:12:47,800 --> 00:12:51,010
you just need to get
comfortable with, right?
213
00:12:51,010 --> 00:12:52,660
So one thing that's
going to be useful
214
00:12:52,660 --> 00:12:55,850
is to know very
quickly what's called
215
00:12:55,850 --> 00:12:57,850
the outer product of a
vector with itself, which
216
00:12:57,850 --> 00:12:59,997
is the vector of times
the vector transpose, what
217
00:12:59,997 --> 00:13:01,330
the entries of these things are.
218
00:13:01,330 --> 00:13:06,510
And that's what we've been using
on this second set of boards.
219
00:13:06,510 --> 00:13:08,290
OK, so everybody
agrees now that we've
220
00:13:08,290 --> 00:13:11,860
sort of showed that the
covariance matrix can
221
00:13:11,860 --> 00:13:14,290
be written in this vector form.
222
00:13:14,290 --> 00:13:17,500
So expectation of XX
transpose minus expectation
223
00:13:17,500 --> 00:13:19,312
of X, expectation
of X transpose.
224
00:13:22,264 --> 00:13:28,060
OK, just like the covariance
can be written in two ways,
225
00:13:28,060 --> 00:13:30,070
right we know that the
covariance can also
226
00:13:30,070 --> 00:13:39,460
be written as the expectation
of Xj minus expectation of Xj
227
00:13:39,460 --> 00:13:45,500
times Xk minus
expectation of Xk, right?
228
00:13:45,500 --> 00:13:50,220
That's the-- sometimes, this
is the original definition
229
00:13:50,220 --> 00:13:50,850
of covariance.
230
00:13:50,850 --> 00:13:52,490
This is the second
definition of covariance.
231
00:13:52,490 --> 00:13:54,031
Just like you have
the variance which
232
00:13:54,031 --> 00:13:57,240
is the expectation of the
square of X minus c of X,
233
00:13:57,240 --> 00:14:00,390
or the expectation X squared
minus the expectation of X
234
00:14:00,390 --> 00:14:01,160
squared.
235
00:14:01,160 --> 00:14:03,420
It's the same thing
for covariance.
236
00:14:03,420 --> 00:14:11,190
And you can actually see this
in terms of vectors, right?
237
00:14:11,190 --> 00:14:14,270
So this actually implies that
you can also rewrite sigma
238
00:14:14,270 --> 00:14:21,780
as the expectation of X
minus expectation of X
239
00:14:21,780 --> 00:14:23,845
times the same thing transpose.
240
00:14:32,191 --> 00:14:32,690
Right?
241
00:14:32,690 --> 00:14:35,950
And the reason is because if
you just distribute those guys,
242
00:14:35,950 --> 00:14:43,760
this is just the
expectation of XX transpose
243
00:14:43,760 --> 00:14:54,800
minus X, expectation of X
transpose minus expectation
244
00:14:54,800 --> 00:14:59,750
of XX transpose.
245
00:14:59,750 --> 00:15:03,608
And then I have plus
expectation of X,
246
00:15:03,608 --> 00:15:05,628
expectation of X transpose.
247
00:15:09,930 --> 00:15:13,110
Now, things could go wrong
because the main difference
248
00:15:13,110 --> 00:15:18,660
between matrices slash
vectors and numbers is
249
00:15:18,660 --> 00:15:21,930
that multiplication
does not commute, right?
250
00:15:21,930 --> 00:15:25,610
So in particular, those two
things are not the same thing.
251
00:15:25,610 --> 00:15:27,860
And so that's the main
difference that we have before,
252
00:15:27,860 --> 00:15:30,336
but it actually does not
matter for our problem.
253
00:15:30,336 --> 00:15:32,210
It's because what's
happening is that if when
254
00:15:32,210 --> 00:15:34,970
I take the expectation
of this guy, then
255
00:15:34,970 --> 00:15:38,940
it's actually the same as the
expectation of this guy, OK?
256
00:15:38,940 --> 00:15:43,540
And so just because the
expectation is linear--
257
00:15:48,230 --> 00:15:50,550
so what we have
is that sigma now
258
00:15:50,550 --> 00:15:55,560
becomes equal to the
expectation of XX transpose
259
00:15:55,560 --> 00:15:59,130
minus the expectation
of X, expectation
260
00:15:59,130 --> 00:16:03,170
of X transpose minus
expectation of X,
261
00:16:03,170 --> 00:16:07,110
expectation of X transpose.
262
00:16:07,110 --> 00:16:10,030
And then I have--
263
00:16:10,030 --> 00:16:14,070
well, really, what
I have is this guy.
264
00:16:14,070 --> 00:16:15,990
And then I have
plus the expectation
265
00:16:15,990 --> 00:16:19,680
of X, expectation
of X transpose.
266
00:16:23,970 --> 00:16:28,570
And now, those three things are
actually equal to each other
267
00:16:28,570 --> 00:16:30,700
just because the
expectation of X transpose
268
00:16:30,700 --> 00:16:34,145
is the same as the
expectation of X transpose.
269
00:16:34,145 --> 00:16:35,520
And so what I'm
left with is just
270
00:16:35,520 --> 00:16:44,364
the expectation of XX transpose
minus the expectation of X,
271
00:16:44,364 --> 00:16:49,650
expectation of X transpose, OK?
272
00:16:49,650 --> 00:16:51,610
So same thing that's
happening when
273
00:16:51,610 --> 00:16:53,110
you want to prove
that you can write
274
00:16:53,110 --> 00:16:57,760
the covariance either
this way or that way.
275
00:16:57,760 --> 00:17:00,980
The same thing happens for
matrices, or for vectors,
276
00:17:00,980 --> 00:17:02,340
right, or a covariance matrix.
277
00:17:02,340 --> 00:17:04,609
They go together.
278
00:17:04,609 --> 00:17:05,920
Is there any questions so far?
279
00:17:05,920 --> 00:17:09,460
And if you have some, please
tell me, because I want to--
280
00:17:09,460 --> 00:17:12,490
I don't know to which extent you
guys are comfortable with this
281
00:17:12,490 --> 00:17:13,420
at all or not.
282
00:17:16,700 --> 00:17:19,810
OK, so let's move on.
283
00:17:19,810 --> 00:17:23,460
All right, so of
course, this is what
284
00:17:23,460 --> 00:17:26,420
I'm describing in terms of
the distribution right here.
285
00:17:26,420 --> 00:17:28,359
I took expectations.
286
00:17:28,359 --> 00:17:30,140
Covariances are
also expectations.
287
00:17:30,140 --> 00:17:32,560
So those depend on some
distribution of X, right?
288
00:17:32,560 --> 00:17:34,630
If I wanted to compute
that, I would basically
289
00:17:34,630 --> 00:17:36,601
need to know what the
distribution of X is.
290
00:17:36,601 --> 00:17:37,975
Now, we're doing
statistics, so I
291
00:17:37,975 --> 00:17:41,180
need to [INAUDIBLE] my question
is going to be to say, well,
292
00:17:41,180 --> 00:17:44,380
how well can I estimate the
covariance matrix itself,
293
00:17:44,380 --> 00:17:47,260
or some properties of
this covariance matrix
294
00:17:47,260 --> 00:17:48,405
based on data?
295
00:17:48,405 --> 00:17:50,140
All right, so if I
want to understand
296
00:17:50,140 --> 00:17:52,990
what my covariance matrix
looks like based on data,
297
00:17:52,990 --> 00:17:54,940
I'm going to have
to basically form
298
00:17:54,940 --> 00:17:57,760
its empirical
counterparts, which
299
00:17:57,760 --> 00:18:02,200
I can do by doing the age-old
statistical trick, which
300
00:18:02,200 --> 00:18:04,700
is replace your expectation
by an average, all right?
301
00:18:04,700 --> 00:18:06,658
So let's just-- everything
that's on the board,
302
00:18:06,658 --> 00:18:09,310
you see expectation, just
replace it by an average.
303
00:18:09,310 --> 00:18:14,230
OK, so, now I'm going
to be given X1, Xn.
304
00:18:14,230 --> 00:18:16,551
So, I'm going to define
the empirical mean.
305
00:18:19,780 --> 00:18:22,290
OK so, really, the idea
is take your expectation
306
00:18:22,290 --> 00:18:24,970
and replace it by 1
over n sum, right?
307
00:18:24,970 --> 00:18:28,230
And so the empirical
mean is just 1 over n.
308
00:18:28,230 --> 00:18:31,510
Some of the Xi's--
309
00:18:31,510 --> 00:18:34,070
I'm guessing everybody knows
how to average vectors.
310
00:18:34,070 --> 00:18:36,110
It's just the average
of the coordinates.
311
00:18:36,110 --> 00:18:39,730
So I will write this as X bar.
312
00:18:39,730 --> 00:18:51,440
And the empirical covariance
matrix, often called
313
00:18:51,440 --> 00:18:57,520
sample covariance matrix,
hence the notation, S.
314
00:18:57,520 --> 00:18:59,800
Well, this is my
covariance matrix, right?
315
00:18:59,800 --> 00:19:02,650
Let's just replace the
expectations by averages.
316
00:19:02,650 --> 00:19:12,160
1 over n, sum from i equal 1 to
n, of Xi, Xi transpose, minus--
317
00:19:12,160 --> 00:19:14,290
this is the expectation
of X. I will replace it
318
00:19:14,290 --> 00:19:21,380
by the average, which I just
called X bar, X bar transpose,
319
00:19:21,380 --> 00:19:22,590
OK?
320
00:19:22,590 --> 00:19:25,480
And that's when I
want to use the--
321
00:19:25,480 --> 00:19:28,430
that's when I want
to use the notation--
322
00:19:28,430 --> 00:19:30,670
the second definition,
but I could actually
323
00:19:30,670 --> 00:19:35,530
do exactly the same thing
using this definition here.
324
00:19:35,530 --> 00:19:38,750
Sorry, using this
definition right here.
325
00:19:38,750 --> 00:19:42,340
So this is actually
1 over n, sum from i
326
00:19:42,340 --> 00:19:55,240
equal 1 to n, of Xi minus X
bar, Xi minus X bar transpose.
327
00:19:55,240 --> 00:19:56,560
And those are actually--
328
00:19:56,560 --> 00:19:58,367
I mean, in a way,
it looks like I
329
00:19:58,367 --> 00:19:59,950
could define two
different estimators,
330
00:19:59,950 --> 00:20:01,630
but you can actually check.
331
00:20:01,630 --> 00:20:03,700
And I do encourage
you to do this.
332
00:20:03,700 --> 00:20:05,920
If you're not comfortable
making those manipulations,
333
00:20:05,920 --> 00:20:08,294
you can actually check that
those two things are actually
334
00:20:08,294 --> 00:20:15,216
exactly the same, OK?
335
00:20:20,540 --> 00:20:25,070
So now, I'm going to want
to talk about matrices, OK?
336
00:20:25,070 --> 00:20:27,260
And remember, we defined
this big matrix, X,
337
00:20:27,260 --> 00:20:28,790
with the double bar.
338
00:20:28,790 --> 00:20:31,160
And the question
is, can I express
339
00:20:31,160 --> 00:20:35,360
both X bar and the
sample covariance matrix
340
00:20:35,360 --> 00:20:37,460
in terms of this big matrix, X?
341
00:20:37,460 --> 00:20:39,740
Because right now,
it's still expressed
342
00:20:39,740 --> 00:20:40,820
in terms of the vectors.
343
00:20:40,820 --> 00:20:43,220
I'm summing those vectors,
vectors transpose.
344
00:20:43,220 --> 00:20:46,050
The question is, can I just
do that in a very compact way,
345
00:20:46,050 --> 00:20:50,110
in a way that I can actually
remove this sum term,
346
00:20:50,110 --> 00:20:50,610
all right?
347
00:20:50,610 --> 00:20:52,990
That's going to be the goal.
348
00:20:52,990 --> 00:20:54,850
I mean, that's not
a notational goal.
349
00:20:54,850 --> 00:20:58,091
That's really something
that we want--
350
00:20:58,091 --> 00:20:59,590
that's going to be
convenient for us
351
00:20:59,590 --> 00:21:02,740
just like it was convenient
to talk about matrices when
352
00:21:02,740 --> 00:21:04,199
we did linear regression.
353
00:21:23,180 --> 00:21:26,340
OK, X bar.
354
00:21:26,340 --> 00:21:30,000
We just said it's 1 over
n, sum from I equal 1 to n
355
00:21:30,000 --> 00:21:32,730
of Xi, right?
356
00:21:32,730 --> 00:21:35,100
Now remember, what does
this matrix look like?
357
00:21:35,100 --> 00:21:39,010
We said that X bar--
358
00:21:39,010 --> 00:21:40,270
X is this guy.
359
00:21:40,270 --> 00:21:45,930
So if I look at X transpose,
the columns of this guy
360
00:21:45,930 --> 00:21:51,430
becomes X1, my first
observation, X2,
361
00:21:51,430 --> 00:21:54,840
my second observation, all the
way to Xn, my last observation,
362
00:21:54,840 --> 00:21:56,280
right?
363
00:21:56,280 --> 00:21:56,850
Agreed?
364
00:21:56,850 --> 00:21:58,470
That's what X transpose is.
365
00:21:58,470 --> 00:22:00,960
So if I want to
sum those guys, I
366
00:22:00,960 --> 00:22:02,700
can multiply by the
all-ones vector.
367
00:22:06,284 --> 00:22:08,700
All right, so that's what the
definition of the all-ones 1
368
00:22:08,700 --> 00:22:09,250
vector is.
369
00:22:11,840 --> 00:22:19,870
Well, it's just a bunch of
1's in Rn, in this case.
370
00:22:19,870 --> 00:22:23,620
And so when I do X transpose 1,
what I get is just the sum from
371
00:22:23,620 --> 00:22:27,690
i equal 1 to n of the Xi's.
372
00:22:27,690 --> 00:22:36,460
So if I divide by n,
I get my average, OK?
373
00:22:36,460 --> 00:22:43,200
So here, I definitely
removed the sum term.
374
00:22:43,200 --> 00:22:47,930
Let's see if with the covariance
matrix, we can do the same.
375
00:22:47,930 --> 00:22:53,280
Well, and that's actually a
little more difficult to see,
376
00:22:53,280 --> 00:22:55,280
I guess.
377
00:22:55,280 --> 00:23:05,510
But let's use this
definition for S, OK?
378
00:23:05,510 --> 00:23:07,540
And one thing that's
actually going to be--
379
00:23:07,540 --> 00:23:10,260
so, let's see for
one second, what--
380
00:23:10,260 --> 00:23:12,510
so it's going to be
something that involves X,
381
00:23:12,510 --> 00:23:14,377
multiplying X with itself, OK?
382
00:23:14,377 --> 00:23:15,960
And the question is,
is it going to be
383
00:23:15,960 --> 00:23:19,032
multiplying X with X transpose,
or X tranpose with X?
384
00:23:19,032 --> 00:23:20,490
To answer this
question, you can go
385
00:23:20,490 --> 00:23:23,960
the easy route, which says,
well, my covariance matrix is
386
00:23:23,960 --> 00:23:24,870
of size, what?
387
00:23:24,870 --> 00:23:27,682
What is the size of S?
388
00:23:27,682 --> 00:23:28,670
AUDIENCE: d by d.
389
00:23:28,670 --> 00:23:30,260
PHILIPPE RIGOLLET: d by d, OK?
390
00:23:30,260 --> 00:23:34,200
X is of size n by d.
391
00:23:34,200 --> 00:23:35,760
So if I do X times
X transpose, I'm
392
00:23:35,760 --> 00:23:37,760
going to have something
which is of size n by n.
393
00:23:37,760 --> 00:23:39,426
If I do X transpose
X, I'm going to have
394
00:23:39,426 --> 00:23:40,794
something which is d by d.
395
00:23:40,794 --> 00:23:41,710
That's the easy route.
396
00:23:41,710 --> 00:23:44,150
And there's basically
one of the two guys.
397
00:23:44,150 --> 00:23:46,130
You can actually open
the box a little bit
398
00:23:46,130 --> 00:23:48,170
and see what's
going on in there.
399
00:23:48,170 --> 00:23:52,760
If you do X transpose X, which
we know gives you a d by d,
400
00:23:52,760 --> 00:23:54,920
you'll see that X is
going to have vectors that
401
00:23:54,920 --> 00:23:57,465
are of the form,
Xi, and X transpose
402
00:23:57,465 --> 00:24:02,230
is going to have vectors that
are of the form, Xi transpose,
403
00:24:02,230 --> 00:24:03,710
right?
404
00:24:03,710 --> 00:24:06,810
And so, this is actually
probably the right way to go.
405
00:24:06,810 --> 00:24:11,690
So let's look at what's X
transpose X is giving us.
406
00:24:11,690 --> 00:24:16,850
So I claim that it's actually
going to give us what we want,
407
00:24:16,850 --> 00:24:19,710
but rather than actually
going there, let's--
408
00:24:19,710 --> 00:24:22,700
to actually-- I mean, we
could check it entry by entry,
409
00:24:22,700 --> 00:24:25,400
but there's actually a
nice thing we can do.
410
00:24:25,400 --> 00:24:28,090
Before we go there,
let's write X transpose
411
00:24:28,090 --> 00:24:33,260
as the following sum of
variables, X1 and then
412
00:24:33,260 --> 00:24:36,270
just a bunch of 0's
everywhere else.
413
00:24:36,270 --> 00:24:39,410
So it's still d by n.
414
00:24:39,410 --> 00:24:42,470
So n minus 1 of the columns
are equal to 0 here.
415
00:24:42,470 --> 00:24:45,860
Then I'm going to put
a 0 and then put X2.
416
00:24:45,860 --> 00:24:48,550
And then just a
bunch of 0's, right?
417
00:24:48,550 --> 00:24:59,940
So that's just 0, 0 plus 0,
0, all the way to Xn, OK?
418
00:24:59,940 --> 00:25:01,260
Everybody agrees with it?
419
00:25:01,260 --> 00:25:03,650
See what I'm doing here?
420
00:25:03,650 --> 00:25:06,150
I'm just splitting it into
a sum of matrices that
421
00:25:06,150 --> 00:25:08,730
only have one nonzero columns.
422
00:25:08,730 --> 00:25:11,210
But clearly, that's true.
423
00:25:11,210 --> 00:25:15,610
Now let's look at the product
of this guy with itself.
424
00:25:15,610 --> 00:25:23,396
So, let's call these
matrices M1, M2, Mn.
425
00:25:26,890 --> 00:25:30,750
So when I do X
transpose X, what I
426
00:25:30,750 --> 00:25:37,970
do is the sum of the
Mi's for i equal 1 to n,
427
00:25:37,970 --> 00:25:48,620
times the sum of the
Mi transpose, right?
428
00:25:48,620 --> 00:25:50,840
Now, the sum of
the Mi's transpose
429
00:25:50,840 --> 00:25:55,274
is just the sum of each
of the Mi's transpose, OK?
430
00:25:58,190 --> 00:26:00,620
So now I just have this
product of two sums,
431
00:26:00,620 --> 00:26:03,290
so I'm just going to
re-index the second one by j.
432
00:26:03,290 --> 00:26:12,650
So this is sum for i equal
1 to n, j equal 1 to n of Mi
433
00:26:12,650 --> 00:26:15,600
Mj transpose.
434
00:26:15,600 --> 00:26:16,100
OK?
435
00:26:19,036 --> 00:26:20,410
And now what we
want to notice is
436
00:26:20,410 --> 00:26:26,000
that if i is different
from j, what's happening?
437
00:26:26,000 --> 00:26:34,380
Well if i is different from j,
let's look at say, M1 times XM2
438
00:26:34,380 --> 00:26:35,040
transpose.
439
00:26:54,067 --> 00:26:56,150
So what is the product
between those two matrices?
440
00:27:04,404 --> 00:27:09,870
AUDIENCE: It's a new
entry and [INAUDIBLE]
441
00:27:09,870 --> 00:27:11,370
PHILIPPE RIGOLLET:
There's an entry?
442
00:27:11,370 --> 00:27:12,801
AUDIENCE: Well, it's an entry.
443
00:27:12,801 --> 00:27:17,116
It's like a dot product in that
form next to [? transpose. ?]
444
00:27:17,116 --> 00:27:19,490
PHILIPPE RIGOLLET: You mean
a dot product is just getting
445
00:27:19,490 --> 00:27:20,360
[INAUDIBLE] number, right?
446
00:27:20,360 --> 00:27:22,068
So I want-- this is
going to be a matrix.
447
00:27:22,068 --> 00:27:24,550
It's the product of
two matrices, right?
448
00:27:24,550 --> 00:27:27,100
This is a matrix times a matrix.
449
00:27:27,100 --> 00:27:31,210
So this should be a matrix,
right, of size d by d.
450
00:27:35,960 --> 00:27:37,610
Yeah, I should
see a lot of hands
451
00:27:37,610 --> 00:27:39,060
that look like this, right?
452
00:27:39,060 --> 00:27:40,200
Because look at this.
453
00:27:40,200 --> 00:27:42,450
So let's multiply the first--
454
00:27:42,450 --> 00:27:45,215
let's look at what's going
on in the first column here.
455
00:27:45,215 --> 00:27:48,840
I'm multiplying this column
with each of those rows.
456
00:27:48,840 --> 00:27:50,480
The only nonzero
coefficient is here,
457
00:27:50,480 --> 00:27:54,190
and it only hits
this column of 0's.
458
00:27:54,190 --> 00:27:57,036
So every time, this is going
to give you 0, 0, 0, 0.
459
00:27:57,036 --> 00:28:00,020
And it's going to be the same
for every single one of them.
460
00:28:00,020 --> 00:28:04,420
So this matrix is just
full of 0's, right?
461
00:28:04,420 --> 00:28:06,130
They never hit each
other when I do
462
00:28:06,130 --> 00:28:08,350
the matrix-matrix
multiplication.
463
00:28:08,350 --> 00:28:11,811
There's no-- every
non-zero hits a 0.
464
00:28:11,811 --> 00:28:13,560
So what it means is--
and this, of course,
465
00:28:13,560 --> 00:28:16,020
you can check for every
i different from j.
466
00:28:16,020 --> 00:28:22,290
So this means that Mi times
Mj transpose is actually
467
00:28:22,290 --> 00:28:27,150
equal to 0 when i is
different from j, Right?
468
00:28:27,150 --> 00:28:29,370
Everybody is OK with this?
469
00:28:29,370 --> 00:28:32,670
So what that means is that when
I do this double sum, really,
470
00:28:32,670 --> 00:28:33,670
it's a simple sum.
471
00:28:33,670 --> 00:28:37,310
There's only just the
sum from i equal 1
472
00:28:37,310 --> 00:28:41,820
to n of Mi Mi transpose.
473
00:28:41,820 --> 00:28:44,820
Because this is the only
terms in this double sum
474
00:28:44,820 --> 00:28:48,980
that are not going to be 0 when
[INAUDIBLE] [? M1 ?] with M1
475
00:28:48,980 --> 00:28:50,492
itself.
476
00:28:50,492 --> 00:28:51,950
Now, let's see
what's going on when
477
00:28:51,950 --> 00:28:53,930
I do M1 times M1 transpose.
478
00:28:53,930 --> 00:28:57,890
Well, now, if I do Mi
times and Mi transpose,
479
00:28:57,890 --> 00:29:00,300
now this guy becomes [? X1 ?]
[INAUDIBLE] it's here.
480
00:29:00,300 --> 00:29:03,830
And so now, I really have
X1 times X1 transpose.
481
00:29:03,830 --> 00:29:06,785
So this is really
just the sum from i
482
00:29:06,785 --> 00:29:20,080
equal 1 to n of Xi Xi transpose,
just because Mi Mi transpose
483
00:29:20,080 --> 00:29:21,716
is Xi Xi transpose.
484
00:29:21,716 --> 00:29:22,840
There's nothing else there.
485
00:29:26,190 --> 00:29:28,520
So that's the good news, right?
486
00:29:28,520 --> 00:29:37,100
This term here is really just
X transpose X divided by n.
487
00:29:43,460 --> 00:29:45,740
OK, I can use that
guy again, I guess.
488
00:29:45,740 --> 00:29:46,260
Well, no.
489
00:29:46,260 --> 00:30:08,602
Let's just-- OK, so
let me rewrite S.
490
00:30:08,602 --> 00:30:10,310
All right, that's the
definition we have.
491
00:30:10,310 --> 00:30:14,990
And we know that this guy
already is equal to 1 over n X
492
00:30:14,990 --> 00:30:20,960
transpose X. x bar
x bar transpose--
493
00:30:20,960 --> 00:30:25,950
we know that x bar-- we
just proved that x bar--
494
00:30:25,950 --> 00:30:31,080
sorry, little x
bar was equal to 1
495
00:30:31,080 --> 00:30:36,652
over n X bar transpose
times the all-ones vector.
496
00:30:36,652 --> 00:30:37,860
So I'm just going to do that.
497
00:30:37,860 --> 00:30:39,340
So that's just
going to be minus.
498
00:30:39,340 --> 00:30:40,999
I'm going to pull
my two 1 over n's--
499
00:30:40,999 --> 00:30:42,540
one from this guy,
one from this guy.
500
00:30:42,540 --> 00:30:44,530
So I'm going to get
1 over n squared.
501
00:30:44,530 --> 00:30:47,070
And then I'm going
to get X bar--
502
00:30:47,070 --> 00:30:48,690
sorry, there's no X bar here.
503
00:30:48,690 --> 00:30:50,908
It's just X. Yeah.
504
00:30:50,908 --> 00:30:59,861
X transpose all ones times X
transpose all ones transpose,
505
00:30:59,861 --> 00:31:00,360
right?
506
00:31:04,580 --> 00:31:07,580
And X transpose all
ones transpose--
507
00:31:11,800 --> 00:31:14,200
right, the rule-- if I
have A times B transpose,
508
00:31:14,200 --> 00:31:16,180
it's B transpose times
A transpose, right?
509
00:31:23,460 --> 00:31:25,060
That's just the rule
of transposition.
510
00:31:25,060 --> 00:31:31,400
So this is 1
transpose X transpose.
511
00:31:31,400 --> 00:31:34,120
And so when I put all
these guys together,
512
00:31:34,120 --> 00:31:38,365
this is actually equal to 1
over n X transpose X minus one
513
00:31:38,365 --> 00:31:47,670
over n squared X transpose
1, 1 transpose X. Because X
514
00:31:47,670 --> 00:31:50,466
transpose transposes X, OK?
515
00:31:53,700 --> 00:31:55,950
So now, I can actually--
516
00:31:55,950 --> 00:31:59,435
I have something which is
of the form, X transpose X--
517
00:31:59,435 --> 00:32:01,800
[INAUDIBLE] to the left, X
transpose; to the right, X.
518
00:32:01,800 --> 00:32:04,930
Here, I have X transpose to
the left, X to the right.
519
00:32:04,930 --> 00:32:07,690
So it can factor out
whatever's in there.
520
00:32:07,690 --> 00:32:11,640
So I can write S as 1 over n--
521
00:32:11,640 --> 00:32:17,230
sorry, X transpose times 1 over
n times the identity of Rd.
522
00:32:21,610 --> 00:32:33,110
And then I have minus 1
over n, 1, 1 transpose X.
523
00:32:33,110 --> 00:32:34,490
OK, because if you--
524
00:32:34,490 --> 00:32:36,770
I mean, you can
distribute it back, right?
525
00:32:36,770 --> 00:32:38,090
So here, I'm going to get what?
526
00:32:38,090 --> 00:32:41,810
X transpose identity times X,
the whole thing divided by n.
527
00:32:41,810 --> 00:32:42,777
That's this term.
528
00:32:42,777 --> 00:32:45,110
And then the second one is
going to be-- sorry, 1 over n
529
00:32:45,110 --> 00:32:46,110
squared.
530
00:32:46,110 --> 00:32:50,840
And then I'm going to get 1 over
n squared times X transpose 1,
531
00:32:50,840 --> 00:32:53,990
1 transpose which is
this guy, times X,
532
00:32:53,990 --> 00:32:58,580
and that's the [? right ?]
[? thing, ?] OK?
533
00:32:58,580 --> 00:33:01,820
So, the way it's written, I
factored out one of the 1 over
534
00:33:01,820 --> 00:33:02,320
n's.
535
00:33:02,320 --> 00:33:05,500
So I'm just going to do the
same thing as on this slide.
536
00:33:05,500 --> 00:33:08,110
So I'm just factoring
out this 1 over n here.
537
00:33:08,110 --> 00:33:16,280
So it's 1 over n times
X transpose identity
538
00:33:16,280 --> 00:33:21,010
of our d divided by n
divided by 1 this time,
539
00:33:21,010 --> 00:33:26,780
minus 1 over n 1, 1
transpose times X, OK?
540
00:33:26,780 --> 00:33:28,395
So that's just
what's on the slides.
541
00:33:31,720 --> 00:33:35,874
What does the matrix, 1,
1 transpose, look like?
542
00:33:35,874 --> 00:33:36,790
AUDIENCE: All 1's.
543
00:33:36,790 --> 00:33:38,623
PHILIPPE RIGOLLET: It's
just all 1's, right?
544
00:33:38,623 --> 00:33:41,060
Because the entries are the
products of the all-ones--
545
00:33:41,060 --> 00:33:42,750
of the coordinates of
the all-ones vectors with
546
00:33:42,750 --> 00:33:45,208
the coordinates of the all-ones
vectors, so I only get 1's.
547
00:33:45,208 --> 00:33:49,610
So it's a d by d
matrix with only 1's.
548
00:33:49,610 --> 00:33:52,170
So this matrix, I can
actually write exactly, right?
549
00:33:52,170 --> 00:33:55,710
H, this matrix that
I called H which
550
00:33:55,710 --> 00:33:59,430
is what's sandwiched in-between
this X transpose and X.
551
00:33:59,430 --> 00:34:02,760
By definition, I said this
is the definition of H. Then
552
00:34:02,760 --> 00:34:06,060
this thing, I can write
its coordinates exactly.
553
00:34:18,880 --> 00:34:23,110
We know it's identity
divided by n minus--
554
00:34:23,110 --> 00:34:25,330
sorry, I don't know
why I keep [INAUDIBLE]..
555
00:34:25,330 --> 00:34:29,110
Minus 1 over n 1, 1 transpose--
556
00:34:29,110 --> 00:34:30,940
so it's this matrix
with the only 1's
557
00:34:30,940 --> 00:34:34,389
on the diagonals and 0's and
elsewhere-- minus a matrix that
558
00:34:34,389 --> 00:34:36,487
only has 1 over n everywhere.
559
00:34:41,469 --> 00:34:49,820
OK, so the whole thing is 1
minus 1 over n on the diagonals
560
00:34:49,820 --> 00:34:57,430
and then minus 1
over n here, OK?
561
00:34:57,430 --> 00:35:01,920
And now I claim that this matrix
is an orthogonal projector.
562
00:35:01,920 --> 00:35:05,580
Now, I'm writing this, but
it's completely useless.
563
00:35:05,580 --> 00:35:08,190
This is just a way for you to
see that it's actually very
564
00:35:08,190 --> 00:35:11,430
convenient now to think
about this problem
565
00:35:11,430 --> 00:35:14,850
as being a matrix
problem, because things
566
00:35:14,850 --> 00:35:17,890
are much nicer when you
think about the actual form
567
00:35:17,890 --> 00:35:18,890
of your matrices, right?
568
00:35:18,890 --> 00:35:21,090
They could tell you,
here is the matrix.
569
00:35:21,090 --> 00:35:23,340
I mean, imagine you're
sitting at a midterm,
570
00:35:23,340 --> 00:35:25,910
and I say, here's the
matrix that has 1 minus 1
571
00:35:25,910 --> 00:35:28,640
over n on the diagonals
and minus 1 over n
572
00:35:28,640 --> 00:35:30,010
on the [INAUDIBLE] diagonal.
573
00:35:30,010 --> 00:35:32,855
Prove to me that it's
a projector matrix.
574
00:35:32,855 --> 00:35:34,230
You're going to
have to basically
575
00:35:34,230 --> 00:35:35,520
take this guy times itself.
576
00:35:35,520 --> 00:35:37,497
It's going to be really
complicated, right?
577
00:35:37,497 --> 00:35:38,580
So we know it's symmetric.
578
00:35:38,580 --> 00:35:39,930
That's for sure.
579
00:35:39,930 --> 00:35:42,120
But the fact that it
has this particular way
580
00:35:42,120 --> 00:35:44,100
of writing it is
going to make my life
581
00:35:44,100 --> 00:35:45,599
super easy to check this.
582
00:35:45,599 --> 00:35:47,140
That's the definition
of a projector.
583
00:35:47,140 --> 00:35:48,930
It has to be
symmetric and it has
584
00:35:48,930 --> 00:35:51,270
to square to itself
because we just
585
00:35:51,270 --> 00:35:54,300
said in the chapter
on linear regression
586
00:35:54,300 --> 00:35:57,360
that once you project, if you
apply the projection again,
587
00:35:57,360 --> 00:35:59,610
you're not moving because
you're already there.
588
00:35:59,610 --> 00:36:04,469
OK, so why is H
squared equal to H?
589
00:36:04,469 --> 00:36:05,760
Well let's just write H square.
590
00:36:05,760 --> 00:36:09,300
It's the identity
minus 1 over n 1, 1
591
00:36:09,300 --> 00:36:16,610
transpose times the
identity minus 1 over n 1, 1
592
00:36:16,610 --> 00:36:19,370
transpose, right?
593
00:36:19,370 --> 00:36:22,490
Let's just expand this now.
594
00:36:22,490 --> 00:36:25,350
This is equal to
the identity minus--
595
00:36:25,350 --> 00:36:29,280
well, the identity times 1, 1
transpose is just the identity.
596
00:36:29,280 --> 00:36:31,900
So it's 1, 1 transpose, sorry.
597
00:36:31,900 --> 00:36:38,840
So 1 over n 1, 1 transpose
minus 1 over n 1, 1 transpose.
598
00:36:38,840 --> 00:36:40,400
And then there's
going to be what
599
00:36:40,400 --> 00:36:42,710
makes the deal is that
I get this 1 over n
600
00:36:42,710 --> 00:36:44,750
squared this time.
601
00:36:44,750 --> 00:36:46,950
And then I get the product
of 1 over n trans--
602
00:36:46,950 --> 00:36:48,200
oh, let's write it completely.
603
00:36:48,200 --> 00:36:58,010
I get 1, 1 transpose
times 1, 1 transpose, OK?
604
00:36:58,010 --> 00:37:01,260
But this thing here--
605
00:37:01,260 --> 00:37:03,840
what is this?
606
00:37:03,840 --> 00:37:06,359
n, right, is the end product
of the all-ones vector
607
00:37:06,359 --> 00:37:07,400
with the all-ones vector.
608
00:37:07,400 --> 00:37:10,740
So I'm just summing n times
1 squared, which is n.
609
00:37:10,740 --> 00:37:11,980
So this is equal to n.
610
00:37:11,980 --> 00:37:13,920
So I pull it out,
cancel one of the ends,
611
00:37:13,920 --> 00:37:15,870
and I'm back to
what I had before.
612
00:37:15,870 --> 00:37:21,720
So I had identity minus 2
over n 1, 1 transpose plus 1
613
00:37:21,720 --> 00:37:27,530
over n 1, 1 transpose
which is equal to H.
614
00:37:27,530 --> 00:37:30,700
Because one of the 1
over n's cancel, OK?
615
00:37:36,264 --> 00:37:37,430
So it's a projection matrix.
616
00:37:37,430 --> 00:37:41,030
It's projecting onto
some linear space, right?
617
00:37:41,030 --> 00:37:42,450
It's taking a matrix.
618
00:37:42,450 --> 00:37:44,480
Sorry, it's taking
a vector and it's
619
00:37:44,480 --> 00:37:46,535
projecting onto a
certain space of vectors.
620
00:37:49,255 --> 00:37:50,229
What is this space?
621
00:37:53,160 --> 00:37:54,920
Right, so, how do
you-- so I'm only
622
00:37:54,920 --> 00:37:57,500
asking the answer to this
question in words, right?
623
00:37:57,500 --> 00:37:59,830
So how would you
describe the vectors
624
00:37:59,830 --> 00:38:02,950
onto which this
matrix is projecting?
625
00:38:02,950 --> 00:38:05,050
Well, if you want to
answer this question,
626
00:38:05,050 --> 00:38:07,870
the way you would tackle
it is first by saying, OK,
627
00:38:07,870 --> 00:38:13,690
what does a vector which is of
the form, H times something,
628
00:38:13,690 --> 00:38:14,960
look like, right?
629
00:38:14,960 --> 00:38:16,870
What can I say about
this vector that's
630
00:38:16,870 --> 00:38:19,540
going to be definitely
giving me something
631
00:38:19,540 --> 00:38:21,760
about the space on
which it projects?
632
00:38:21,760 --> 00:38:24,800
I need to know a little more to
know that it projects exactly
633
00:38:24,800 --> 00:38:25,820
onto this.
634
00:38:25,820 --> 00:38:29,050
But one way we can
do this is just
635
00:38:29,050 --> 00:38:30,440
see how it acts on a vector.
636
00:38:30,440 --> 00:38:32,370
What does it do to a
vector to apply H, right?
637
00:38:32,370 --> 00:38:44,550
So I take v. And let's see what
taking v and applying H to it
638
00:38:44,550 --> 00:38:46,410
looks like.
639
00:38:46,410 --> 00:38:48,750
Well, it's the identity
minus something.
640
00:38:48,750 --> 00:38:50,640
So it takes v and
it removes something
641
00:38:50,640 --> 00:38:54,160
from v. What does it remove?
642
00:38:54,160 --> 00:39:00,590
Well, it's 1 over n
times v transpose 1 times
643
00:39:00,590 --> 00:39:03,861
the all-ones vector, right?
644
00:39:03,861 --> 00:39:04,360
Agreed?
645
00:39:04,360 --> 00:39:13,570
I just wrote v transpose 1
instead of 1 transpose v,
646
00:39:13,570 --> 00:39:16,250
which are the same thing.
647
00:39:16,250 --> 00:39:17,310
What is this thing?
648
00:39:25,160 --> 00:39:27,765
What should I call it in
mathematical notation?
649
00:39:30,720 --> 00:39:31,460
v bar, right?
650
00:39:31,460 --> 00:39:35,150
I should all it v bar because
this is exactly the average
651
00:39:35,150 --> 00:39:38,840
of the entries of v, agreed?
652
00:39:38,840 --> 00:39:41,560
This is summing the entries
of v's, and this is dividing
653
00:39:41,560 --> 00:39:43,170
by the number of those v's.
654
00:39:43,170 --> 00:39:44,860
Sorry, now v is in our--
655
00:39:49,162 --> 00:39:51,074
sorry, why do I divide by--
656
00:39:53,950 --> 00:39:59,070
I'm just-- OK, I need to check
what my dimensions are now.
657
00:39:59,070 --> 00:40:00,390
No, it's in Rd, right?
658
00:40:00,390 --> 00:40:02,660
So why do I divide by n?
659
00:40:05,520 --> 00:40:07,720
So it's not really v bar.
660
00:40:07,720 --> 00:40:13,910
It's the sum of the
v's divided by--
661
00:40:13,910 --> 00:40:14,870
right, so it's v bar.
662
00:40:24,024 --> 00:40:25,163
AUDIENCE: [INAUDIBLE]
663
00:40:25,163 --> 00:40:25,996
[INTERPOSING VOICES]
664
00:40:25,996 --> 00:40:27,968
AUDIENCE: Yeah, v
has to be [INAUDIBLE]
665
00:40:27,968 --> 00:40:29,450
PHILIPPE RIGOLLET: Oh, yeah.
666
00:40:29,450 --> 00:40:31,120
OK, thank you.
667
00:40:31,120 --> 00:40:34,750
So everywhere I wrote
Hd, that was actually Hn.
668
00:40:34,750 --> 00:40:35,290
Oh, man.
669
00:40:35,290 --> 00:40:37,220
I wish I had a computer now.
670
00:40:37,220 --> 00:40:37,720
All right.
671
00:40:37,720 --> 00:40:43,230
So-- yeah, because the--
672
00:40:43,230 --> 00:40:43,740
yeah, right?
673
00:40:43,740 --> 00:40:45,775
So why it's not--
674
00:40:45,775 --> 00:40:48,150
well, why I thought it was
this is because I was thinking
675
00:40:48,150 --> 00:40:49,890
about the outer
dimension of X, really
676
00:40:49,890 --> 00:40:51,780
of X transpose, which is
really the inner dimension,
677
00:40:51,780 --> 00:40:52,914
didn't matter to me, right?
678
00:40:52,914 --> 00:40:55,080
So the thing that I can
sandwich between X transpose
679
00:40:55,080 --> 00:40:56,790
and X has to be n by n.
680
00:40:56,790 --> 00:40:58,800
So this was actually n by n.
681
00:40:58,800 --> 00:41:00,480
And so that's actually n by n.
682
00:41:00,480 --> 00:41:03,330
Everything is n by n.
683
00:41:03,330 --> 00:41:04,308
Sorry about that.
684
00:41:08,220 --> 00:41:09,400
So this is n.
685
00:41:09,400 --> 00:41:10,440
This is n.
686
00:41:10,440 --> 00:41:12,130
This is-- well, I
didn't really tell you
687
00:41:12,130 --> 00:41:16,290
what the all-ones vector
was, but it's also in our n.
688
00:41:16,290 --> 00:41:18,430
Yeah, OK.
689
00:41:22,190 --> 00:41:23,730
Thank you.
690
00:41:23,730 --> 00:41:27,939
And n-- actually, I used the
fact that this was of size n
691
00:41:27,939 --> 00:41:28,480
here already.
692
00:41:31,690 --> 00:41:33,340
OK, and so that's indeed v bar.
693
00:41:38,996 --> 00:41:40,870
So what is this projection
doing to a vector?
694
00:41:47,470 --> 00:41:51,930
It's removing its average
on each coordinate, right?
695
00:41:51,930 --> 00:41:54,570
And the effect of this
is that v is a vector.
696
00:41:54,570 --> 00:41:58,355
What is the average of Hv?
697
00:41:58,355 --> 00:41:59,340
AUDIENCE: 0.
698
00:41:59,340 --> 00:42:00,840
PHILIPPE RIGOLLET:
Right, so it's 0.
699
00:42:00,840 --> 00:42:04,050
It's the average of v, which
is v bar, minus the average
700
00:42:04,050 --> 00:42:07,230
of something that only has v
bar's entry, which is v bar.
701
00:42:07,230 --> 00:42:08,490
So this thing is actually 0.
702
00:42:11,560 --> 00:42:12,840
So let me repeat my question.
703
00:42:12,840 --> 00:42:15,700
Onto what subspace
does H project?
704
00:42:22,700 --> 00:42:26,670
Onto the subspace of
vectors that have mean 0.
705
00:42:26,670 --> 00:42:30,010
A vector that has
mean 0 is a vector.
706
00:42:30,010 --> 00:42:34,970
So if you want to talk more
linear algebra, v bar--
707
00:42:34,970 --> 00:42:36,750
for a vector you
have mean 0, it means
708
00:42:36,750 --> 00:42:43,440
that v is orthogonal to the
span of the all-ones vector.
709
00:42:43,440 --> 00:42:44,280
That's it.
710
00:42:44,280 --> 00:42:46,080
It projects to this space.
711
00:42:46,080 --> 00:42:47,930
So in words, it
projects onto the space
712
00:42:47,930 --> 00:42:49,880
of vectors that have 0 mean.
713
00:42:49,880 --> 00:42:52,380
In linear algebra,
it says it projects
714
00:42:52,380 --> 00:42:55,760
onto the hyperplane
which is orthogonal
715
00:42:55,760 --> 00:42:58,360
to the all-ones vector, OK?
716
00:42:58,360 --> 00:43:01,860
So that's all.
717
00:43:01,860 --> 00:43:04,760
Can you guys still
see the screen?
718
00:43:04,760 --> 00:43:05,940
Are you good over there?
719
00:43:05,940 --> 00:43:07,420
OK.
720
00:43:07,420 --> 00:43:12,030
All right, so now, what it
means is that, well, I'm
721
00:43:12,030 --> 00:43:13,280
doing this weird thing, right?
722
00:43:13,280 --> 00:43:15,360
I'm taking the inner product--
723
00:43:15,360 --> 00:43:20,030
so S is taking X. And then
it's removing its mean of each
724
00:43:20,030 --> 00:43:21,440
of the columns of X, right?
725
00:43:21,440 --> 00:43:24,530
When I take H times X, I'm
basically applying this
726
00:43:24,530 --> 00:43:26,780
projection which consists
in removing the mean of all
727
00:43:26,780 --> 00:43:28,430
the X's.
728
00:43:28,430 --> 00:43:31,340
And then I multiply
by H transpose.
729
00:43:31,340 --> 00:43:33,550
But what's actually
nice is that, remember,
730
00:43:33,550 --> 00:43:35,930
H is a projector.
731
00:43:35,930 --> 00:43:38,000
Sorry, I don't
want to keep that.
732
00:43:38,000 --> 00:43:47,010
Which means that when I
look at X transpose HX,
733
00:43:47,010 --> 00:43:52,410
it's the same as looking
at X transpose H squared X.
734
00:43:52,410 --> 00:43:54,420
But since H is equal
to its transpose,
735
00:43:54,420 --> 00:43:58,020
this is actually the same
as looking at X transpose H
736
00:43:58,020 --> 00:44:07,146
transpose HX, which is the
same as looking at HX transpose
737
00:44:07,146 --> 00:44:11,000
HX, OK?
738
00:44:11,000 --> 00:44:14,300
So what it's doing, it's
first applying this projection
739
00:44:14,300 --> 00:44:18,950
matrix, H, which removes the
mean of each of your columns,
740
00:44:18,950 --> 00:44:23,000
and then looks at the inner
products between those guys,
741
00:44:23,000 --> 00:44:23,586
right?
742
00:44:23,586 --> 00:44:25,460
Each entry of this guy
is just the covariance
743
00:44:25,460 --> 00:44:27,320
between those centered things.
744
00:44:27,320 --> 00:44:28,910
That's all it's doing.
745
00:44:28,910 --> 00:44:35,450
All right, so those are actually
going to be the key statements.
746
00:44:35,450 --> 00:44:37,270
So everything we've
done so far is really
747
00:44:37,270 --> 00:44:38,920
mainly linear algebra, right?
748
00:44:38,920 --> 00:44:41,950
I mean, looking at expectations
and covariances was just--
749
00:44:41,950 --> 00:44:44,200
we just used the fact that
the expectation was linear.
750
00:44:44,200 --> 00:44:45,520
We didn't do much.
751
00:44:45,520 --> 00:44:47,450
But now there's a nice
thing that's happening.
752
00:44:47,450 --> 00:44:50,050
And that's why we're
going to switch
753
00:44:50,050 --> 00:44:51,550
from the language
of linear algebra
754
00:44:51,550 --> 00:44:53,710
to more statistical,
because what's happening
755
00:44:53,710 --> 00:44:57,010
is that if I look at this
quadratic form, right?
756
00:44:57,010 --> 00:44:59,080
So I take sigma.
757
00:44:59,080 --> 00:45:00,462
So I take a vector, u.
758
00:45:03,630 --> 00:45:09,180
And I'm going to look at
u-- so let's say, in Rd.
759
00:45:09,180 --> 00:45:14,796
And I'm going to look
at u transpose sigma u.
760
00:45:14,796 --> 00:45:15,295
OK?
761
00:45:18,510 --> 00:45:19,720
What is this doing?
762
00:45:19,720 --> 00:45:24,630
Well, we know that u transpose
sigma u is equal to what?
763
00:45:24,630 --> 00:45:31,720
Well, sigma is the
expectation of XX transpose
764
00:45:31,720 --> 00:45:35,610
minus the expectation of X
expectation of X transpose,
765
00:45:35,610 --> 00:45:36,110
right?
766
00:45:39,460 --> 00:45:40,948
So I just substitute in there.
767
00:45:46,100 --> 00:45:49,370
Now, u is deterministic.
768
00:45:49,370 --> 00:45:52,250
So in particular, I can push
it inside the expectation
769
00:45:52,250 --> 00:45:55,180
here, agreed?
770
00:45:55,180 --> 00:45:57,200
And I can do the
same from the right.
771
00:45:57,200 --> 00:46:00,800
So here, when I push u
transpose here, and u here,
772
00:46:00,800 --> 00:46:06,170
what I'm left with is the
expectation of u transpose X
773
00:46:06,170 --> 00:46:09,990
times X transpose u.
774
00:46:09,990 --> 00:46:11,436
OK?
775
00:46:11,436 --> 00:46:14,050
And now, I can do the
same thing for this guy.
776
00:46:14,050 --> 00:46:17,410
And this tells me that this is
the expectation of u transpose
777
00:46:17,410 --> 00:46:21,340
X times the expectation
of X transpose u.
778
00:46:24,640 --> 00:46:29,260
Of course, u transpose X
is equal to X transpose u.
779
00:46:29,260 --> 00:46:31,330
And u-- yeah.
780
00:46:31,330 --> 00:46:33,910
So what it means is
that this is actually
781
00:46:33,910 --> 00:46:43,700
equal to the expectation
of u transpose X squared
782
00:46:43,700 --> 00:46:48,020
minus the expectation
of u transpose X,
783
00:46:48,020 --> 00:46:49,065
the whole thing squared.
784
00:46:56,900 --> 00:46:58,900
But this is something
that should look familiar.
785
00:46:58,900 --> 00:47:01,316
This is really just the variance
of this particular random
786
00:47:01,316 --> 00:47:03,360
variable which is of
the form, u transpose X,
787
00:47:03,360 --> 00:47:06,900
right? u transpose
X is a number.
788
00:47:06,900 --> 00:47:10,110
It involves a random vector,
so it's a random variable.
789
00:47:10,110 --> 00:47:11,580
And so it has a variance.
790
00:47:11,580 --> 00:47:15,430
And this variance is exactly
given by this formula.
791
00:47:15,430 --> 00:47:19,595
So this is just the
variance of u transpose X.
792
00:47:19,595 --> 00:47:21,720
So what we've proved is
that if I look at this guy,
793
00:47:21,720 --> 00:47:29,772
this is really just the
variance of u transpose X, OK?
794
00:47:37,580 --> 00:47:40,930
I can do the same thing
for the sample variance.
795
00:47:40,930 --> 00:47:41,770
So let's do this.
796
00:47:48,240 --> 00:47:52,140
And as you can
see, spoiler alert,
797
00:47:52,140 --> 00:47:56,334
this is going to be
the sample variance.
798
00:47:59,590 --> 00:48:09,430
OK, so remember, S is 1 over n,
sum of Xi Xi transpose minus X
799
00:48:09,430 --> 00:48:12,100
bar X bar transpose.
800
00:48:12,100 --> 00:48:16,060
So when I do u
transpose, Su, what
801
00:48:16,060 --> 00:48:19,400
it gives me is 1 over
n sum from i equal 1
802
00:48:19,400 --> 00:48:25,780
to n of u transpose Xi times
Xi transpose u, all right?
803
00:48:25,780 --> 00:48:27,880
So those are two numbers
that multiply each other
804
00:48:27,880 --> 00:48:30,370
and that happen to be
equal to each other,
805
00:48:30,370 --> 00:48:36,430
minus u transpose X
bar X bar transpose u,
806
00:48:36,430 --> 00:48:38,770
which is also the product
of two numbers that happen
807
00:48:38,770 --> 00:48:39,997
to be equal to each other.
808
00:48:39,997 --> 00:48:41,455
So I can rewrite
this with squares.
809
00:48:55,120 --> 00:48:57,390
So we're almost there.
810
00:48:57,390 --> 00:49:00,360
All I need to know to check
is that this thing is actually
811
00:49:00,360 --> 00:49:02,010
the average of
those guys, right?
812
00:49:02,010 --> 00:49:04,530
So u transpose X bar.
813
00:49:04,530 --> 00:49:05,030
What is it?
814
00:49:05,030 --> 00:49:10,980
It's 1 over n sum from i equal
1 to n of u transpose Xi.
815
00:49:10,980 --> 00:49:17,050
So it's really something that I
can write as u transpose X bar,
816
00:49:17,050 --> 00:49:17,550
right?
817
00:49:17,550 --> 00:49:19,383
That's the average of
those random variables
818
00:49:19,383 --> 00:49:21,240
of the form, u transpose Xi.
819
00:49:23,880 --> 00:49:29,910
So what it means is that u
transpose Su, I can write as 1
820
00:49:29,910 --> 00:49:38,060
over n sum from i equal 1 to
n of u transpose Xi squared
821
00:49:38,060 --> 00:49:46,720
minus u transpose X
bar squared, which
822
00:49:46,720 --> 00:49:51,660
is the empirical variance
that we need noted by small
823
00:49:51,660 --> 00:49:54,600
s squared, right?
824
00:49:54,600 --> 00:50:06,850
So that's the empirical variance
of u transpose X1 all the way
825
00:50:06,850 --> 00:50:08,209
to u transpose Xn.
826
00:50:12,430 --> 00:50:13,910
OK, and here, same thing.
827
00:50:13,910 --> 00:50:15,210
I use exactly the same thing.
828
00:50:15,210 --> 00:50:17,990
I just use the fact that here,
the only thing I use is really
829
00:50:17,990 --> 00:50:20,790
the linearity of this
guy, of 1 over n sum
830
00:50:20,790 --> 00:50:24,020
or the linearity of expectation,
that I can push things
831
00:50:24,020 --> 00:50:26,740
in there, OK?
832
00:50:30,224 --> 00:50:31,640
AUDIENCE: So what
you have written
833
00:50:31,640 --> 00:50:33,844
at the end of that
sum for uT Su?
834
00:50:33,844 --> 00:50:35,010
PHILIPPE RIGOLLET: This one?
835
00:50:35,010 --> 00:50:35,380
AUDIENCE: Yeah.
836
00:50:35,380 --> 00:50:37,290
PHILIPPE RIGOLLET: Yeah, I
said it's equal to small s,
837
00:50:37,290 --> 00:50:39,430
and I want to make a
difference between the big S
838
00:50:39,430 --> 00:50:40,660
that I'm using here.
839
00:50:40,660 --> 00:50:42,650
So this is equal to small--
840
00:50:42,650 --> 00:50:45,190
I don't know, I'm
trying to make it look
841
00:50:45,190 --> 00:50:47,550
like a calligraphic s squared.
842
00:50:56,870 --> 00:51:00,040
OK, so this is nice, right?
843
00:51:00,040 --> 00:51:04,120
This covariance matrix-- so
let's look at capital sigma
844
00:51:04,120 --> 00:51:05,210
itself right now.
845
00:51:05,210 --> 00:51:07,070
This covariance matrix,
we know that if we
846
00:51:07,070 --> 00:51:11,690
read its entries, what
we get is the covariance
847
00:51:11,690 --> 00:51:15,260
between the coordinates
of the X's, right,
848
00:51:15,260 --> 00:51:19,140
of the random vector, X.
And the coordinates, well,
849
00:51:19,140 --> 00:51:22,530
by definition, are attached
to a coordinate system.
850
00:51:22,530 --> 00:51:25,830
So I only know
what the covariance
851
00:51:25,830 --> 00:51:30,570
of X in of those two things are,
or the covariance of those two
852
00:51:30,570 --> 00:51:31,320
things are.
853
00:51:31,320 --> 00:51:33,570
But what if I want to find
coordinates between linear
854
00:51:33,570 --> 00:51:35,076
combination of the X's?
855
00:51:35,076 --> 00:51:37,200
Sorry, if I want to find
covariances between linear
856
00:51:37,200 --> 00:51:38,566
combination of those X's.
857
00:51:38,566 --> 00:51:40,440
And that's exactly what
this allows me to do.
858
00:51:40,440 --> 00:51:44,640
It says, well, if I pre-
and post-multiply by u,
859
00:51:44,640 --> 00:51:47,010
this is actually telling
me what the variance
860
00:51:47,010 --> 00:51:51,950
of X along direction u is, OK?
861
00:51:51,950 --> 00:51:53,944
So there's a lot of
information in there,
862
00:51:53,944 --> 00:51:55,610
and it's just really
exploiting the fact
863
00:51:55,610 --> 00:52:00,600
that there is some linearity
going on in the covariance.
864
00:52:00,600 --> 00:52:02,060
So, why variance?
865
00:52:02,060 --> 00:52:03,870
Why is variance
interesting for us, right?
866
00:52:03,870 --> 00:52:04,370
Why?
867
00:52:04,370 --> 00:52:05,760
I started by saying,
here, we're going
868
00:52:05,760 --> 00:52:07,050
to be interested
in having something
869
00:52:07,050 --> 00:52:08,151
to do dimension reduction.
870
00:52:08,151 --> 00:52:10,650
We have-- think of your points
as [? being in a ?] dimension
871
00:52:10,650 --> 00:52:13,990
larger than 4, and we're going
to try to reduce the dimension.
872
00:52:13,990 --> 00:52:15,480
So let's just think
for one second,
873
00:52:15,480 --> 00:52:19,320
what do we want about a
dimension reduction procedure?
874
00:52:19,320 --> 00:52:23,427
If I have all my points that
live in, say, three dimensions,
875
00:52:23,427 --> 00:52:25,260
and I have one point
here and one point here
876
00:52:25,260 --> 00:52:28,020
and one point here and one
point here and one point here,
877
00:52:28,020 --> 00:52:30,090
and I decide to project
them onto some plane--
878
00:52:30,090 --> 00:52:32,132
that I take a plane that's
just like this, what's
879
00:52:32,132 --> 00:52:34,673
going to happen is that those
points are all going to project
880
00:52:34,673 --> 00:52:36,030
to the same point, right?
881
00:52:36,030 --> 00:52:38,070
I'm just going to
not see anything.
882
00:52:38,070 --> 00:52:40,410
However, if I take a
plane which is like this,
883
00:52:40,410 --> 00:52:42,932
they're all going to
project into some nice line.
884
00:52:42,932 --> 00:52:44,640
Maybe I can even
project them onto a line
885
00:52:44,640 --> 00:52:47,160
and they will still be
far apart from each other.
886
00:52:47,160 --> 00:52:48,160
So that's what you want.
887
00:52:48,160 --> 00:52:51,930
You want to be able to
say, when I take my points
888
00:52:51,930 --> 00:52:54,610
and I say I project them
onto lower dimensions,
889
00:52:54,610 --> 00:52:57,270
I do not want them to collapse
into one single point.
890
00:52:57,270 --> 00:53:00,540
I want them to be spread as
possible in the direction
891
00:53:00,540 --> 00:53:02,251
on which I project.
892
00:53:02,251 --> 00:53:04,000
And this is what we're
going to try to do.
893
00:53:04,000 --> 00:53:06,510
And of course, measuring
spread between points
894
00:53:06,510 --> 00:53:08,160
can be done in many ways, right?
895
00:53:08,160 --> 00:53:09,960
I mean, you could
look at, I don't know,
896
00:53:09,960 --> 00:53:12,900
sum of pairwise distances
between those guys.
897
00:53:12,900 --> 00:53:14,790
You could look at
some sort of energy.
898
00:53:14,790 --> 00:53:16,380
You can look at
many ways to measure
899
00:53:16,380 --> 00:53:18,199
of spread in a direction.
900
00:53:18,199 --> 00:53:19,740
But variance is a
good way to measure
901
00:53:19,740 --> 00:53:21,150
of spread between points.
902
00:53:21,150 --> 00:53:23,727
If you have a lot of
variance between your points,
903
00:53:23,727 --> 00:53:25,560
then chances are they're
going to be spread.
904
00:53:25,560 --> 00:53:27,720
Now, this is not
always the case, right?
905
00:53:27,720 --> 00:53:30,480
If I have a direction in which
all my points are clumped
906
00:53:30,480 --> 00:53:33,234
onto one big point and
one other big point,
907
00:53:33,234 --> 00:53:34,900
it's going to choose
this because that's
908
00:53:34,900 --> 00:53:37,180
the direction that
has a lot of variance.
909
00:53:37,180 --> 00:53:39,030
But hopefully, the
variance is going
910
00:53:39,030 --> 00:53:41,560
to spread things out nicely.
911
00:53:41,560 --> 00:53:47,730
So the idea of principal
component analysis
912
00:53:47,730 --> 00:53:51,330
is going to try to
identify those variances--
913
00:53:51,330 --> 00:53:55,740
those directions along which
we have a lot of variance.
914
00:53:55,740 --> 00:53:57,870
Reciprocally, we're
going to try to eliminate
915
00:53:57,870 --> 00:54:01,890
the directions along which we do
not have a lot of variance, OK?
916
00:54:01,890 --> 00:54:02,640
And let's see why.
917
00:54:02,640 --> 00:54:08,130
Well, if-- so here's
the first claim.
918
00:54:08,130 --> 00:54:14,000
If you transpose Su is equal
to 0, what's happening?
919
00:54:14,000 --> 00:54:17,159
Well, I know that an empirical
variance is equal to 0.
920
00:54:17,159 --> 00:54:18,950
What does it mean for
an empirical variance
921
00:54:18,950 --> 00:54:22,056
to be equal to 0?
922
00:54:22,056 --> 00:54:23,680
So I give you a bunch
of points, right?
923
00:54:23,680 --> 00:54:26,420
So those points are those
points-- u transpose
924
00:54:26,420 --> 00:54:29,090
X1, u transpose-- those
are a bunch of numbers.
925
00:54:29,090 --> 00:54:31,090
What does it mean to have
the empirical variance
926
00:54:31,090 --> 00:54:33,279
of those points
being equal to 0?
927
00:54:33,279 --> 00:54:34,570
AUDIENCE: They're all the same.
928
00:54:34,570 --> 00:54:36,590
PHILIPPE RIGOLLET:
They're all the same.
929
00:54:36,590 --> 00:54:43,680
So what it means is that
when I have my points, right?
930
00:54:43,680 --> 00:54:46,470
So, can you find a direction
for those points in which they
931
00:54:46,470 --> 00:54:48,850
project to all the same point?
932
00:54:51,400 --> 00:54:52,360
No, right?
933
00:54:52,360 --> 00:54:53,590
There's no such thing.
934
00:54:53,590 --> 00:54:55,870
For this to happen, you have
to have your points which
935
00:54:55,870 --> 00:54:57,849
are perfectly aligned.
936
00:54:57,849 --> 00:54:59,390
And then when you're
going to project
937
00:54:59,390 --> 00:55:01,830
onto the orthogonal
of this guy, they're
938
00:55:01,830 --> 00:55:03,690
going to all project
to the same point
939
00:55:03,690 --> 00:55:06,450
here, which means that
the empirical variance is
940
00:55:06,450 --> 00:55:08,790
going to be 0.
941
00:55:08,790 --> 00:55:10,270
Now, this is an extreme case.
942
00:55:10,270 --> 00:55:11,760
This will never
happen in practice,
943
00:55:11,760 --> 00:55:13,840
because if that
happens, well, I mean,
944
00:55:13,840 --> 00:55:16,850
you can basically figure
that out very quickly.
945
00:55:16,850 --> 00:55:21,520
So in the same way,
it's very unlikely
946
00:55:21,520 --> 00:55:23,710
that you're going to have
u transpose sigma u, which
947
00:55:23,710 --> 00:55:26,230
is equal to 0, which means
that, essentially, all
948
00:55:26,230 --> 00:55:28,510
your points are [INAUDIBLE]
or let's say all of them
949
00:55:28,510 --> 00:55:30,069
are orthogonal to u, right?
950
00:55:30,069 --> 00:55:31,360
So it's exactly the same thing.
951
00:55:31,360 --> 00:55:33,330
It just says that in
the population case,
952
00:55:33,330 --> 00:55:36,960
there's no probability that your
points deviate from this guy
953
00:55:36,960 --> 00:55:37,510
here.
954
00:55:37,510 --> 00:55:41,142
This happens with
zero probability, OK?
955
00:55:41,142 --> 00:55:42,600
And that's just
because if you look
956
00:55:42,600 --> 00:55:46,690
at the variance of this
guy, it's going to be 0.
957
00:55:46,690 --> 00:55:48,910
And then that means that
there's no deviation.
958
00:55:48,910 --> 00:55:51,430
By the way, I'm using
the name projection
959
00:55:51,430 --> 00:55:55,510
when I talk about u
transpose X, right?
960
00:55:55,510 --> 00:55:59,170
So let's just be
clear about this.
961
00:55:59,170 --> 00:56:04,090
If you-- so let's say I
have a bunch of points,
962
00:56:04,090 --> 00:56:06,050
and u is a vector
in this direction.
963
00:56:06,050 --> 00:56:08,650
And let's say that u has the--
964
00:56:08,650 --> 00:56:10,120
so this is 0.
965
00:56:10,120 --> 00:56:10,720
This is u.
966
00:56:10,720 --> 00:56:17,560
And let's say that
u has norm, 1, OK?
967
00:56:17,560 --> 00:56:21,140
When I look, what is the
coordinate of the projection?
968
00:56:21,140 --> 00:56:23,860
So what is the length
of this guy here?
969
00:56:23,860 --> 00:56:25,569
Let's call this guy X1.
970
00:56:25,569 --> 00:56:26,860
What is the length of this guy?
971
00:56:31,150 --> 00:56:32,330
In terms of inner products?
972
00:56:35,990 --> 00:56:39,678
This is exactly u transpose X1.
973
00:56:39,678 --> 00:56:42,730
This length here,
if this is X2, this
974
00:56:42,730 --> 00:56:46,580
is exactly u transpose X2, OK?
975
00:56:46,580 --> 00:56:52,430
So those-- u transpose X
measure exactly the distance
976
00:56:52,430 --> 00:56:55,700
to the origin of those--
977
00:56:55,700 --> 00:56:58,310
I mean, it's really--
978
00:56:58,310 --> 00:57:00,887
think of it as being
just an x-axis thing.
979
00:57:00,887 --> 00:57:02,220
You just have a bunch of points.
980
00:57:02,220 --> 00:57:02,960
You have an origin.
981
00:57:02,960 --> 00:57:04,520
And it's really just
telling you what
982
00:57:04,520 --> 00:57:07,670
the coordinate on this
axis is going to be, right?
983
00:57:07,670 --> 00:57:10,820
So in particular, if the
empirical variance is 0,
984
00:57:10,820 --> 00:57:12,470
it means that all
these points project
985
00:57:12,470 --> 00:57:14,840
to the same point, which
means that they have
986
00:57:14,840 --> 00:57:16,912
to be orthogonal to this guy.
987
00:57:16,912 --> 00:57:19,370
And you can think of it as
being also maybe an entire plane
988
00:57:19,370 --> 00:57:23,990
that's orthogonal
to this line, OK?
989
00:57:23,990 --> 00:57:26,590
So that's why I talk
about projection,
990
00:57:26,590 --> 00:57:29,560
because the inner
products, u transpose X,
991
00:57:29,560 --> 00:57:36,220
is really measuring
the coordinates of X
992
00:57:36,220 --> 00:57:39,410
when u becomes the x-axis.
993
00:57:39,410 --> 00:57:42,820
Now, if u does not have
norm 1, then you just
994
00:57:42,820 --> 00:57:44,365
have a change of scale here.
995
00:57:44,365 --> 00:57:46,790
You just have a
change of unit, right?
996
00:57:46,790 --> 00:57:51,560
So this is really u times X1.
997
00:57:51,560 --> 00:57:54,044
The coordinates should really
be divided by the norm of u.
998
00:57:59,150 --> 00:58:04,970
OK, so now, just in
the same way-- so
999
00:58:04,970 --> 00:58:07,160
we're never going
to have exactly 0.
1000
00:58:07,160 --> 00:58:08,810
But if we [INAUDIBLE]
the other end,
1001
00:58:08,810 --> 00:58:12,050
if u transpose Su is
large, what does it mean?
1002
00:58:14,990 --> 00:58:17,740
It means that when
I look at my points
1003
00:58:17,740 --> 00:58:22,194
as projected onto the
axis generated by u,
1004
00:58:22,194 --> 00:58:23,860
they're going to have
a lot of variance.
1005
00:58:23,860 --> 00:58:25,930
They're going to be far away
from each other in average,
1006
00:58:25,930 --> 00:58:26,430
right?
1007
00:58:26,430 --> 00:58:28,900
That's what large variance
means, or at least
1008
00:58:28,900 --> 00:58:31,310
large empirical variance means.
1009
00:58:31,310 --> 00:58:34,690
And same thing for u.
1010
00:58:34,690 --> 00:58:36,130
So what we're going
to try to find
1011
00:58:36,130 --> 00:58:39,870
is a u that maximizes this.
1012
00:58:39,870 --> 00:58:42,230
If I can find a u
that maximizes this
1013
00:58:42,230 --> 00:58:44,790
so I can look in
every direction,
1014
00:58:44,790 --> 00:58:48,320
and suddenly I find a direction
in which the spread is massive,
1015
00:58:48,320 --> 00:58:50,070
then that's a point
on which I'm basically
1016
00:58:50,070 --> 00:58:52,260
the less likely
to have my points
1017
00:58:52,260 --> 00:58:54,824
project onto each other
and collide, right?
1018
00:58:54,824 --> 00:58:56,490
At least I know they're
going to project
1019
00:58:56,490 --> 00:58:59,710
at least onto two points.
1020
00:58:59,710 --> 00:59:02,290
So the idea now is
to say, OK, let's try
1021
00:59:02,290 --> 00:59:04,630
to maximize this spread, right?
1022
00:59:04,630 --> 00:59:09,130
So we're going to try to
find the maximum over all u's
1023
00:59:09,130 --> 00:59:12,886
of u transpose Su.
1024
00:59:12,886 --> 00:59:15,010
And that's going to be the
direction that maximizes
1025
00:59:15,010 --> 00:59:15,968
the empirical variance.
1026
00:59:15,968 --> 00:59:22,075
Now of course, if I read it
like that for all u's in Rd,
1027
00:59:22,075 --> 00:59:23,666
what is the value
of this maximum?
1028
00:59:28,060 --> 00:59:29,220
It's infinity, right?
1029
00:59:29,220 --> 00:59:32,160
Because I can always
multiply u by 10,
1030
00:59:32,160 --> 00:59:34,662
and this entire thing is
going to multiplied by 100.
1031
00:59:34,662 --> 00:59:36,620
So I'm just going to take
u as large as I want,
1032
00:59:36,620 --> 00:59:38,661
and this thing is going
to be as large as I want,
1033
00:59:38,661 --> 00:59:40,050
and so I need to constrain u.
1034
00:59:40,050 --> 00:59:42,840
And as I said, I need
to have u of size 1
1035
00:59:42,840 --> 00:59:45,990
to talk about coordinates
in the system generated
1036
00:59:45,990 --> 00:59:47,340
by u like this.
1037
00:59:47,340 --> 00:59:50,730
So I'm just going to
constrain u to have
1038
00:59:50,730 --> 00:59:55,467
Euclidean norm equal to 1, OK?
1039
00:59:55,467 --> 00:59:57,050
So that's going to
be my goal-- trying
1040
00:59:57,050 --> 01:00:01,100
to find the largest
possible u transpose Su,
1041
01:00:01,100 --> 01:00:03,680
or in other words, empirical
variance of the points
1042
01:00:03,680 --> 01:00:07,520
projected onto the direction
u when u is of norm 1,
1043
01:00:07,520 --> 01:00:11,039
which justifies to use
the word, "direction,"
1044
01:00:11,039 --> 01:00:12,830
and because there's no
magnitude to this u.
1045
01:00:17,770 --> 01:00:22,410
OK, so how am I
going to do this?
1046
01:00:22,410 --> 01:00:25,230
I could just fold and
say, let's just optimize
1047
01:00:25,230 --> 01:00:26,700
this thing, right?
1048
01:00:26,700 --> 01:00:28,540
Let's just take this problem.
1049
01:00:28,540 --> 01:00:32,250
It says maximize a function
onto some constraints.
1050
01:00:32,250 --> 01:00:34,125
Immediately, the constraint
is sort of nasty.
1051
01:00:34,125 --> 01:00:37,212
I'm on a sphere, and I'm trying
to move points on the sphere.
1052
01:00:37,212 --> 01:00:38,670
And I'm maximizing
this thing which
1053
01:00:38,670 --> 01:00:40,182
actually happens to be convex.
1054
01:00:40,182 --> 01:00:42,390
And we know we know how to
minimize convex functions,
1055
01:00:42,390 --> 01:00:45,280
but maximize them is
a different question.
1056
01:00:45,280 --> 01:00:47,340
And so this problem
might be super hard.
1057
01:00:47,340 --> 01:00:49,020
So I can just say,
OK, here's what
1058
01:00:49,020 --> 01:00:52,950
I want to do, and let me
give that to an optimizer
1059
01:00:52,950 --> 01:00:56,010
and just hope that the optimizer
can solve this problem for me.
1060
01:00:56,010 --> 01:00:57,630
That's one thing we can do.
1061
01:00:57,630 --> 01:01:00,092
Now as you can imagine, PCA
is so well spread, right?
1062
01:01:00,092 --> 01:01:01,800
Principal component
analysis is something
1063
01:01:01,800 --> 01:01:03,700
that people do constantly.
1064
01:01:03,700 --> 01:01:06,190
And so that means that we
know how to do this fast.
1065
01:01:06,190 --> 01:01:07,600
So that's one thing.
1066
01:01:07,600 --> 01:01:10,740
The other thing that you should
probably question about why--
1067
01:01:10,740 --> 01:01:13,110
if this thing is actually
difficult, why in the world
1068
01:01:13,110 --> 01:01:16,200
would you even choose the
variance as a measure of spread
1069
01:01:16,200 --> 01:01:19,020
if there's so many
measures of spread, right?
1070
01:01:19,020 --> 01:01:21,222
The variance is one
measure of spread.
1071
01:01:21,222 --> 01:01:22,680
It's not guaranteed
that everything
1072
01:01:22,680 --> 01:01:26,366
is going to project nicely
far apart from each other.
1073
01:01:26,366 --> 01:01:27,990
So we could choose
the variance, but we
1074
01:01:27,990 --> 01:01:28,800
could choose something else.
1075
01:01:28,800 --> 01:01:30,990
If the variance does
not help, why choose it?
1076
01:01:30,990 --> 01:01:32,520
Turns out the variance helps.
1077
01:01:32,520 --> 01:01:35,555
So this is indeed a
non-convex problem.
1078
01:01:35,555 --> 01:01:38,340
I'm maximizing, so
it's actually the same.
1079
01:01:38,340 --> 01:01:41,850
I can make this
constraint convex
1080
01:01:41,850 --> 01:01:43,920
because I'm maximizing
a convex function,
1081
01:01:43,920 --> 01:01:45,720
so it's clear that
the maximum is going
1082
01:01:45,720 --> 01:01:47,220
to be attained at the boundary.
1083
01:01:47,220 --> 01:01:51,540
So I can actually just fill
this ball into some convex ball.
1084
01:01:51,540 --> 01:01:53,430
However, I'm still
maximizing, so this
1085
01:01:53,430 --> 01:01:55,170
is a non-convex problem.
1086
01:01:55,170 --> 01:01:57,550
And this turns out to be the
fanciest non-convex problem
1087
01:01:57,550 --> 01:01:59,001
we know how to solve.
1088
01:01:59,001 --> 01:02:00,750
And the reason why we
know how to solve it
1089
01:02:00,750 --> 01:02:04,410
is not because of optimization
or using gradient-type things
1090
01:02:04,410 --> 01:02:06,690
or anything of the
algorithms that I mentioned
1091
01:02:06,690 --> 01:02:09,350
during the maximum likelihood.
1092
01:02:09,350 --> 01:02:11,000
It's because of linear algebra.
1093
01:02:11,000 --> 01:02:13,980
Linear algebra guarantees that
we know how to solve this.
1094
01:02:13,980 --> 01:02:17,885
And to understand this, we
need to go a little deeper
1095
01:02:17,885 --> 01:02:22,360
in linear algebra, and we
need to understand the concept
1096
01:02:22,360 --> 01:02:24,590
of diagonalization of a matrix.
1097
01:02:24,590 --> 01:02:29,850
So who has ever seen the
concept of an eigenvalue?
1098
01:02:29,850 --> 01:02:30,790
Oh, that's beautiful.
1099
01:02:30,790 --> 01:02:31,880
And if you're not
raising your hand,
1100
01:02:31,880 --> 01:02:33,588
you're just playing
"Candy Crush," right?
1101
01:02:33,588 --> 01:02:35,930
All right, so, OK.
1102
01:02:44,930 --> 01:02:46,640
This is great.
1103
01:02:46,640 --> 01:02:48,160
Everybody's seen it.
1104
01:02:48,160 --> 01:02:51,230
For my live audience of
millions, maybe you have not,
1105
01:02:51,230 --> 01:02:53,600
so I will still go through it.
1106
01:02:53,600 --> 01:02:58,840
All right, so one
of the basic facts--
1107
01:02:58,840 --> 01:03:02,490
and I remember when
I learned this in--
1108
01:03:02,490 --> 01:03:04,090
I mean, when I was
an undergrad, I
1109
01:03:04,090 --> 01:03:05,860
learned about the
spectral decomposition
1110
01:03:05,860 --> 01:03:07,450
and this diagonalization
of matrices.
1111
01:03:07,450 --> 01:03:09,070
And for me, it was just
a structural property
1112
01:03:09,070 --> 01:03:11,445
of matrices, but it turns out
that it's extremely useful,
1113
01:03:11,445 --> 01:03:13,294
and it's useful for
algorithmic purposes.
1114
01:03:13,294 --> 01:03:14,710
And so what this
theorem tells you
1115
01:03:14,710 --> 01:03:16,765
is that if you take
a symmetric matrix--
1116
01:03:22,860 --> 01:03:24,340
well, with real
entries, but that
1117
01:03:24,340 --> 01:03:28,220
really does not matter so much.
1118
01:03:28,220 --> 01:03:30,730
And here, I'm
going to actually--
1119
01:03:30,730 --> 01:03:33,200
so I take a symmetric matrix,
and actually S and sigma
1120
01:03:33,200 --> 01:03:36,190
are two such symmetric
matrices, right?
1121
01:03:36,190 --> 01:03:44,500
Then there exists P
and D, which are both--
1122
01:03:44,500 --> 01:03:47,000
so let's say d by d.
1123
01:03:47,000 --> 01:03:55,960
Which are both d by d
such that P is orthogonal.
1124
01:03:58,960 --> 01:04:02,420
That means that P transpose
P is equal to PP transpose
1125
01:04:02,420 --> 01:04:06,360
is equal to the identity.
1126
01:04:06,360 --> 01:04:07,630
And D is diagonal.
1127
01:04:11,840 --> 01:04:20,130
And sigma, let's say, is
equal to PDP transpose, OK?
1128
01:04:20,130 --> 01:04:22,080
So it's a diagonalization
because it's
1129
01:04:22,080 --> 01:04:23,970
finding a nice transformation.
1130
01:04:23,970 --> 01:04:25,260
P has some nice properties.
1131
01:04:25,260 --> 01:04:28,050
It's really just the change
of coordinates in which
1132
01:04:28,050 --> 01:04:31,044
your matrix is diagonal, right?
1133
01:04:31,044 --> 01:04:32,460
And the way you
want to see this--
1134
01:04:32,460 --> 01:04:35,610
and I think it sort of helps
to think about this problem
1135
01:04:35,610 --> 01:04:36,720
as being--
1136
01:04:36,720 --> 01:04:38,276
sigma being a covariance matrix.
1137
01:04:38,276 --> 01:04:39,900
What does a covariance
matrix tell you?
1138
01:04:39,900 --> 01:04:41,490
Think of a
multivariate Gaussian.
1139
01:04:41,490 --> 01:04:43,660
Can everybody visualize a
three-dimensional Gaussian
1140
01:04:43,660 --> 01:04:45,150
density?
1141
01:04:45,150 --> 01:04:48,200
Right, so it's going to be some
sort of a bell-shaped curve,
1142
01:04:48,200 --> 01:04:51,870
but it might be more elongated
in one direction than another.
1143
01:04:51,870 --> 01:04:54,310
And then going to chop
it like that, all right?
1144
01:04:54,310 --> 01:04:56,120
So I'm going to chop it off.
1145
01:04:56,120 --> 01:05:00,070
And I'm going to look at
how it bleeds, all right?
1146
01:05:00,070 --> 01:05:02,287
So I'm just going to look
at where the blood is.
1147
01:05:02,287 --> 01:05:03,620
And what it's going to look at--
1148
01:05:03,620 --> 01:05:08,720
it's going to look like some
sort of ellipsoid, right?
1149
01:05:08,720 --> 01:05:11,652
In high dimension, it's
just going to be an olive.
1150
01:05:11,652 --> 01:05:13,610
And that is just going
to be bigger and bigger.
1151
01:05:13,610 --> 01:05:16,460
And then I chop it
off a little lower,
1152
01:05:16,460 --> 01:05:20,150
and I get something a
little bigger like this.
1153
01:05:20,150 --> 01:05:23,070
And so it turns out that sigma
is capturing exactly this,
1154
01:05:23,070 --> 01:05:23,570
right?
1155
01:05:23,570 --> 01:05:27,320
The matrix sigma-- so the
center of your covariance matrix
1156
01:05:27,320 --> 01:05:29,240
of your Gaussian is
going to be this thing.
1157
01:05:29,240 --> 01:05:33,690
And sigma is going to tell you
which direction it's elongated.
1158
01:05:33,690 --> 01:05:36,140
And so in particular, if you
look, if you knew an ellipse,
1159
01:05:36,140 --> 01:05:38,160
you know there's something
called principal axis, right?
1160
01:05:38,160 --> 01:05:39,743
So you could actually
define something
1161
01:05:39,743 --> 01:05:43,190
that looks like this, which is
this axis, the one along which
1162
01:05:43,190 --> 01:05:44,390
it's the most elongated.
1163
01:05:44,390 --> 01:05:47,345
Then the axis along which
is orthogonal to it,
1164
01:05:47,345 --> 01:05:49,370
along which it's
slightly less elongated,
1165
01:05:49,370 --> 01:05:52,880
and you go again and again
along the orthogonal ones.
1166
01:05:52,880 --> 01:05:56,500
It turns out that
those things here
1167
01:05:56,500 --> 01:05:59,620
is the new coordinate system
in which this transformation, P
1168
01:05:59,620 --> 01:06:03,190
and P transpose, is
putting you into.
1169
01:06:03,190 --> 01:06:06,390
And D has entries
on the diagonal
1170
01:06:06,390 --> 01:06:09,979
which are exactly this length
and this length, right?
1171
01:06:09,979 --> 01:06:11,270
So that's just what it's doing.
1172
01:06:11,270 --> 01:06:12,920
It's just telling
you, well, if you
1173
01:06:12,920 --> 01:06:16,760
think of having this Gaussian
or this high-dimensional
1174
01:06:16,760 --> 01:06:19,990
ellipsoid, it's elongated
along certain directions.
1175
01:06:19,990 --> 01:06:23,020
And these directions are
actually maybe not well aligned
1176
01:06:23,020 --> 01:06:25,270
with your original coordinate
system, which might just
1177
01:06:25,270 --> 01:06:27,430
be the usual one, right--
1178
01:06:27,430 --> 01:06:29,740
north, south, and east, west.
1179
01:06:29,740 --> 01:06:30,800
Maybe I need to turn it.
1180
01:06:30,800 --> 01:06:33,174
And that's exactly what this
orthogonal transformation is
1181
01:06:33,174 --> 01:06:36,820
doing for you, all right?
1182
01:06:36,820 --> 01:06:39,627
So, in a way, this is actually
telling you even more.
1183
01:06:39,627 --> 01:06:41,710
It's telling you that any
matrix that's symmetric,
1184
01:06:41,710 --> 01:06:45,190
you can actually
turn it somewhere.
1185
01:06:45,190 --> 01:06:47,530
And that'll start to dilate
things in the directions
1186
01:06:47,530 --> 01:06:49,060
that you have, and
then turn it back
1187
01:06:49,060 --> 01:06:50,800
to what you originally had.
1188
01:06:50,800 --> 01:06:53,110
And that's actually
exactly the effect
1189
01:06:53,110 --> 01:06:57,180
of applying a symmetric matrix
through a vector, right?
1190
01:06:57,180 --> 01:06:58,920
And it's pretty impressive.
1191
01:06:58,920 --> 01:07:04,650
It says if I take sigma
times v. Any sigma that's
1192
01:07:04,650 --> 01:07:07,560
of this form, what I'm
doing is-- that's symmetric.
1193
01:07:07,560 --> 01:07:09,360
What I'm really
doing to v is I'm
1194
01:07:09,360 --> 01:07:12,150
changing its coordinate
system, so I'm rotating it.
1195
01:07:12,150 --> 01:07:14,970
Then I'm changing-- I'm
multiplying its coordinates,
1196
01:07:14,970 --> 01:07:16,956
and then I'm rotating it back.
1197
01:07:16,956 --> 01:07:18,330
That's all it's
doing, and that's
1198
01:07:18,330 --> 01:07:21,550
what all symmetric
matrices do, which
1199
01:07:21,550 --> 01:07:24,070
means that this is doing a lot.
1200
01:07:24,070 --> 01:07:27,130
All right, so OK.
1201
01:07:27,130 --> 01:07:29,237
So, what do I know?
1202
01:07:29,237 --> 01:07:30,820
So I'm not going to
prove that this is
1203
01:07:30,820 --> 01:07:32,140
the so-called spectral theorem.
1204
01:07:39,270 --> 01:07:45,850
And the diagonal entries of
D is of the form, lambda 1,
1205
01:07:45,850 --> 01:07:49,980
lambda 2, lambda d, 0, 0.
1206
01:07:49,980 --> 01:08:01,800
And the lambda j's are
called eigenvalues of D.
1207
01:08:01,800 --> 01:08:05,170
Now in general, those numbers
can be positive, negative,
1208
01:08:05,170 --> 01:08:06,660
or equal to 0.
1209
01:08:06,660 --> 01:08:12,000
But here, I know that
sigma and S are--
1210
01:08:12,000 --> 01:08:15,290
well, they're
symmetric for sure,
1211
01:08:15,290 --> 01:08:17,467
but they are positive
semidefinite.
1212
01:08:23,939 --> 01:08:25,840
What does it mean?
1213
01:08:25,840 --> 01:08:30,930
It means that when I take u
transpose sigma u for example,
1214
01:08:30,930 --> 01:08:33,192
this number is
always non-negative.
1215
01:08:35,910 --> 01:08:36,720
Why is this true?
1216
01:08:42,770 --> 01:08:43,609
What is this number?
1217
01:08:47,670 --> 01:08:49,850
It's the variance of--
and actually, I don't even
1218
01:08:49,850 --> 01:08:51,229
need to finish this sentence.
1219
01:08:51,229 --> 01:08:53,957
As soon as I say that
this is a variance, well,
1220
01:08:53,957 --> 01:08:55,040
it has to be non-negative.
1221
01:08:55,040 --> 01:08:57,990
We know that a variance
is not negative.
1222
01:08:57,990 --> 01:09:00,532
And so, that's also a
nice way you can use that.
1223
01:09:00,532 --> 01:09:02,240
So it's just to say,
well, OK, this thing
1224
01:09:02,240 --> 01:09:04,680
is positive semidefinite because
it's a covariance matrix.
1225
01:09:04,680 --> 01:09:06,920
So I know it's a variance, OK?
1226
01:09:06,920 --> 01:09:08,779
So I get this.
1227
01:09:08,779 --> 01:09:10,560
Now, if I had some
negative numbers--
1228
01:09:10,560 --> 01:09:15,350
so the effect of that is that
when I draw this picture,
1229
01:09:15,350 --> 01:09:19,040
those axes are always positive,
which is kind of a weird thing
1230
01:09:19,040 --> 01:09:19,950
to say.
1231
01:09:19,950 --> 01:09:23,840
But what it means is that when
I take a vector, v, I rotate it,
1232
01:09:23,840 --> 01:09:28,250
and then I stretch it in the
directions of the coordinate,
1233
01:09:28,250 --> 01:09:30,260
I cannot flip it.
1234
01:09:30,260 --> 01:09:34,260
I can only stretch or shrink,
but I cannot flip its sign,
1235
01:09:34,260 --> 01:09:34,760
all right?
1236
01:09:34,760 --> 01:09:37,370
But in general, for
any symmetric matrices,
1237
01:09:37,370 --> 01:09:38,840
I could do this.
1238
01:09:38,840 --> 01:09:40,910
But when it's positive
symmetric definite,
1239
01:09:40,910 --> 01:09:43,020
actually what turns out
is that all the lambda
1240
01:09:43,020 --> 01:09:48,350
j's are non-negative.
1241
01:09:48,350 --> 01:09:51,370
I cannot flip it, OK?
1242
01:09:51,370 --> 01:09:53,778
So all the eigenvalues
are non-negative.
1243
01:09:56,590 --> 01:09:58,469
That's a property
of positive semidef.
1244
01:09:58,469 --> 01:10:00,510
So when it's symmetric,
you have the eigenvalues.
1245
01:10:00,510 --> 01:10:01,670
They can be any number.
1246
01:10:01,670 --> 01:10:03,780
And when it's positive
semidefinite, in particular
1247
01:10:03,780 --> 01:10:05,220
that's the case of
the covariance matrix
1248
01:10:05,220 --> 01:10:07,110
and the empirical
covariance matrix, right?
1249
01:10:07,110 --> 01:10:08,940
Because the empirical
covariance matrix
1250
01:10:08,940 --> 01:10:12,150
is an empirical variance,
which itself is non-negative.
1251
01:10:12,150 --> 01:10:17,900
And so I get that the
eigenvalues are non-negative.
1252
01:10:17,900 --> 01:10:23,030
All right, so principal
component analysis is saying,
1253
01:10:23,030 --> 01:10:32,370
OK, I want to find
the direction, u,
1254
01:10:32,370 --> 01:10:38,830
that maximizes u
transpose Su, all right?
1255
01:10:38,830 --> 01:10:40,420
I've just introduced
in one slide
1256
01:10:40,420 --> 01:10:41,690
something about eigenvalues.
1257
01:10:41,690 --> 01:10:44,740
So hopefully, they should help.
1258
01:10:44,740 --> 01:10:47,560
So what is it that I'm
going to be getting?
1259
01:10:47,560 --> 01:10:51,446
Well, let's just
see what happens.
1260
01:10:51,446 --> 01:10:53,570
Oh, I forgot to mention
that-- and I will use this.
1261
01:10:53,570 --> 01:10:56,020
So the lambda j's are
called eigenvectors.
1262
01:10:56,020 --> 01:11:08,690
And then the matrix, P,
has columns v1 to vd, OK?
1263
01:11:08,690 --> 01:11:13,370
The fact that it's orthogonal--
that P transpose P is equal
1264
01:11:13,370 --> 01:11:15,470
to the identity--
1265
01:11:15,470 --> 01:11:20,810
means that those guys
satisfied that vi transpose
1266
01:11:20,810 --> 01:11:27,485
vj is equal to 0 if i
is different from j.
1267
01:11:27,485 --> 01:11:31,040
And vi transpose vi is
actually equal to 1,
1268
01:11:31,040 --> 01:11:33,920
right, because the
entries of PP transpose
1269
01:11:33,920 --> 01:11:38,990
are exactly going to be of
the form, vi transpose vj, OK?
1270
01:11:38,990 --> 01:11:40,890
So those v's are
called eigenvectors.
1271
01:11:46,000 --> 01:11:52,020
And v1 is attached to lambda 1,
and v2 is attached to lambda 2,
1272
01:11:52,020 --> 01:11:53,180
OK?
1273
01:11:53,180 --> 01:11:56,280
So let's see what's
happening with those things.
1274
01:11:56,280 --> 01:11:58,045
What happens if I take sigma--
1275
01:11:58,045 --> 01:12:00,170
so if you know eigenvalues,
you know exactly what's
1276
01:12:00,170 --> 01:12:01,580
going to happen.
1277
01:12:01,580 --> 01:12:06,920
If I look at, say, sigma
times v1, well, what is sigma?
1278
01:12:06,920 --> 01:12:15,440
We know that sigma
is PDP transpose v1.
1279
01:12:15,440 --> 01:12:17,420
What is P transpose times v1?
1280
01:12:17,420 --> 01:12:21,560
Well, P transpose has
rows v1 transpose,
1281
01:12:21,560 --> 01:12:26,850
v2 transpose, all the
way to vd transpose.
1282
01:12:26,850 --> 01:12:30,910
So when I multiply
this by v1, what
1283
01:12:30,910 --> 01:12:32,820
I'm left with is
the first coordinate
1284
01:12:32,820 --> 01:12:38,010
is going to be equal to 1
and the second coordinate is
1285
01:12:38,010 --> 01:12:40,980
going to be equal to 0, right?
1286
01:12:40,980 --> 01:12:42,910
Because they're
orthogonal to each other--
1287
01:12:42,910 --> 01:12:45,810
0 all the way to the end.
1288
01:12:45,810 --> 01:12:48,890
So that's when I
do P transpose v1.
1289
01:12:48,890 --> 01:12:55,250
Now I multiply by
D. Well, I'm just
1290
01:12:55,250 --> 01:12:58,950
multiplying this guy by lambda
1, this guy by lambda 2,
1291
01:12:58,950 --> 01:13:02,150
and this guy by lambda d, so
this is really just lambda 1.
1292
01:13:04,720 --> 01:13:12,080
And now I need to
post-multiply by P.
1293
01:13:12,080 --> 01:13:14,190
So what is P times this guy?
1294
01:13:14,190 --> 01:13:19,730
Well, P is v1 all the way to vd.
1295
01:13:19,730 --> 01:13:21,290
And now I multiply
by a vector that
1296
01:13:21,290 --> 01:13:24,620
only has 0's except
lambda 1 on the first guy.
1297
01:13:24,620 --> 01:13:26,510
So this is just
lambda 1 times v1.
1298
01:13:29,470 --> 01:13:34,630
So what we've proved is that
sigma times v1 is lambda 1 v1,
1299
01:13:34,630 --> 01:13:37,330
and that's probably the
notion of eigenvalue you're
1300
01:13:37,330 --> 01:13:39,010
most comfortable with, right?
1301
01:13:39,010 --> 01:13:41,620
So just when I
multiply by v1, I get
1302
01:13:41,620 --> 01:13:45,440
v1 back multiplied by something,
which is the eigenvalue.
1303
01:13:45,440 --> 01:13:54,450
So in particular, if I look
at v1, transpose sigma v1,
1304
01:13:54,450 --> 01:13:55,180
what do I get?
1305
01:13:55,180 --> 01:13:58,800
Well, I get lambda
1 v1 transpose v1,
1306
01:13:58,800 --> 01:14:00,180
which is 1, right?
1307
01:14:00,180 --> 01:14:04,050
So this is actually
lambda 1 v1 transpose v1,
1308
01:14:04,050 --> 01:14:08,360
which is lambda 1, OK?
1309
01:14:08,360 --> 01:14:10,940
And if I do the same
with v2, clearly I'm
1310
01:14:10,940 --> 01:14:13,450
going to get v2 transpose sigma.
1311
01:14:13,450 --> 01:14:16,910
v2 is equal to lambda 2.
1312
01:14:16,910 --> 01:14:19,910
So for each of the
vj's, I know that if I
1313
01:14:19,910 --> 01:14:21,650
look at the variance
along the vj,
1314
01:14:21,650 --> 01:14:27,760
it's actually exactly given by
those eigenvalues, all right?
1315
01:14:27,760 --> 01:14:38,490
Which proves this, because the
variance along the eigenvectors
1316
01:14:38,490 --> 01:14:40,270
is actually equal
to the eigenvalues.
1317
01:14:40,270 --> 01:14:43,760
So since they're variances,
they have to be non-negative.
1318
01:14:43,760 --> 01:14:47,960
So now, I'm looking for
the one direction that
1319
01:14:47,960 --> 01:14:50,450
has the most variance, right?
1320
01:14:50,450 --> 01:14:53,040
But that's not only
among the eigenvectors.
1321
01:14:53,040 --> 01:14:55,520
That's also among
the other directions
1322
01:14:55,520 --> 01:14:57,200
that are in-between
the eigenvectors.
1323
01:14:57,200 --> 01:14:59,390
If I were to look only
at the eigenvectors,
1324
01:14:59,390 --> 01:15:02,420
it would just tell me, well,
just pick the eigenvector, vj,
1325
01:15:02,420 --> 01:15:05,990
that's associated to the
largest of the lambda j's.
1326
01:15:05,990 --> 01:15:09,080
But it turns out that that's
also true for any vector--
1327
01:15:09,080 --> 01:15:11,810
that the maximum direction is
actually one direction which
1328
01:15:11,810 --> 01:15:13,809
is among the eigenvectors.
1329
01:15:13,809 --> 01:15:16,100
And among the eigenvectors,
we know that the one that's
1330
01:15:16,100 --> 01:15:17,080
the largest--
1331
01:15:17,080 --> 01:15:18,740
that carries the
largest variance is
1332
01:15:18,740 --> 01:15:23,780
the one that's associated to the
largest eigenvalue, all right?
1333
01:15:23,780 --> 01:15:26,990
And so this is what PCA is
going to try to do for me.
1334
01:15:26,990 --> 01:15:29,420
So in practice, that's what
I mentioned already, right?
1335
01:15:29,420 --> 01:15:31,970
We're trying to
project the point cloud
1336
01:15:31,970 --> 01:15:34,730
onto a low-dimensional
space, D prime,
1337
01:15:34,730 --> 01:15:36,800
by keeping as much
information as possible.
1338
01:15:36,800 --> 01:15:39,230
And by "as much information,"
I mean we do not
1339
01:15:39,230 --> 01:15:41,540
want points to collide.
1340
01:15:41,540 --> 01:15:45,530
And so what PCA is
going to do is just
1341
01:15:45,530 --> 01:15:48,231
going to try to project
[? on two ?] directions.
1342
01:15:48,231 --> 01:15:49,730
So there's going
to be a u, and then
1343
01:15:49,730 --> 01:15:52,021
there's going to be something
orthogonal to u, and then
1344
01:15:52,021 --> 01:15:55,550
the third one, et cetera, so
that once we project on those,
1345
01:15:55,550 --> 01:15:59,600
we're keeping as much of the
covariance as possible, OK?
1346
01:15:59,600 --> 01:16:02,859
And in particular,
those directions
1347
01:16:02,859 --> 01:16:04,400
that we're going to
pick are actually
1348
01:16:04,400 --> 01:16:06,920
a subset of the vj's that
are associated to the largest
1349
01:16:06,920 --> 01:16:08,580
eigenvalues.
1350
01:16:08,580 --> 01:16:11,300
So I'm going to
stop here for today.
1351
01:16:11,300 --> 01:16:15,020
We'll finish this on Tuesday.
1352
01:16:15,020 --> 01:16:18,260
But basically, the idea is
it's just the following.
1353
01:16:18,260 --> 01:16:22,590
You're just going to--
well, let me skip one more.
1354
01:16:22,590 --> 01:16:24,812
Yeah, this is the idea.
1355
01:16:24,812 --> 01:16:27,020
You're first going to pick
the eigenvector associated
1356
01:16:27,020 --> 01:16:30,290
to the largest eigenvalue.
1357
01:16:30,290 --> 01:16:33,890
Then you're going to pick
the direction that orthogonal
1358
01:16:33,890 --> 01:16:37,130
to the vector that
you've picked,
1359
01:16:37,130 --> 01:16:38,984
and that's carrying
the most variance.
1360
01:16:38,984 --> 01:16:40,650
And that's actually
the second largest--
1361
01:16:40,650 --> 01:16:44,030
the eigenvector associated to
the second largest eigenvalue.
1362
01:16:44,030 --> 01:16:46,520
And you're going to go all
the way to the number of them
1363
01:16:46,520 --> 01:16:50,120
that you actually want to pick,
which is in this case, d, OK?
1364
01:16:50,120 --> 01:16:53,180
And wherever you choose
to chop this process,
1365
01:16:53,180 --> 01:16:56,390
not going all the way to d,
is going to actually give you
1366
01:16:56,390 --> 01:16:57,890
a lower-dimensional
representation
1367
01:16:57,890 --> 01:17:01,238
in the coordinate system
that's given by v1, v2, v3, et
1368
01:17:01,238 --> 01:17:02,420
cetera, OK?
1369
01:17:02,420 --> 01:17:04,591
So we'll see that in
more details on Tuesday.
1370
01:17:04,591 --> 01:17:06,090
But I don't want
to get into it now.
1371
01:17:06,090 --> 01:17:07,500
We don't have enough time.
1372
01:17:07,500 --> 01:17:10,000
Are there any questions?