WEBVTT

00:00:09.215 --> 00:00:11.320
PATRICK WINSTON: You know, it's
unfortunate that politics

00:00:11.320 --> 00:00:14.730
has become so serious.

00:00:14.730 --> 00:00:17.360
Back when you were little
it was a lot more fun.

00:00:17.360 --> 00:00:20.740
You could make fun
of politicians.

00:00:20.740 --> 00:00:23.815
Here's a politician some
of you may recognize.

00:00:27.480 --> 00:00:31.970
But it's convenient to be able
to vary what this particular

00:00:31.970 --> 00:00:34.210
politician looks like.

00:00:34.210 --> 00:00:41.274
For example, we can go from
a cookie baker to radical.

00:00:41.274 --> 00:00:43.960
[LAUGHTER]

00:00:43.960 --> 00:00:51.544
PATRICK WINSTON: We can go
from superwoman to bimbo.

00:00:51.544 --> 00:00:54.030
[LAUGHTER]

00:00:54.030 --> 00:00:55.920
PATRICK WINSTON: Socialite--

00:00:55.920 --> 00:00:59.710
I put socialite into this.

00:00:59.710 --> 00:01:02.340
There she is.

00:01:02.340 --> 00:01:08.430
Or we can move the slider over
the other way to bag lady.

00:01:08.430 --> 00:01:14.550
Alert, asleep, sad, happy.

00:01:14.550 --> 00:01:18.830
How does that work?

00:01:18.830 --> 00:01:19.340
I don't know.

00:01:19.340 --> 00:01:20.950
But I bet by the end
of this hour you'll

00:01:20.950 --> 00:01:22.360
know how that works.

00:01:22.360 --> 00:01:25.690
And not only that, you'll
understand something about

00:01:25.690 --> 00:01:29.940
what it takes to recognize
faces.

00:01:29.940 --> 00:01:34.380
It turns out to some theories of
face recognition are based

00:01:34.380 --> 00:01:41.791
on the same principles that
this program is based on.

00:01:41.791 --> 00:01:45.030
But you can kind of guess
what's happening here.

00:01:45.030 --> 00:01:49.500
There are many stored images and
when I move those sliders

00:01:49.500 --> 00:01:52.590
it's interpolating
amongst them.

00:01:52.590 --> 00:01:53.840
So that's how that works.

00:01:56.270 --> 00:02:00.500
But the main subject of
today is this matter

00:02:00.500 --> 00:02:02.390
of recognizing objects.

00:02:02.390 --> 00:02:04.620
Faces could be the objects,
but they don't have to be.

00:02:04.620 --> 00:02:08.430
This could be an object that you
might want to recognize.

00:02:08.430 --> 00:02:11.580
And I want to talk to you a
little bit about the history

00:02:11.580 --> 00:02:13.930
of this problem and where
it stands today.

00:02:13.930 --> 00:02:15.900
It's still not solved.

00:02:15.900 --> 00:02:18.760
But it's an interesting exercise
to see how the

00:02:18.760 --> 00:02:22.360
attempts at solution
have evolved slowly

00:02:22.360 --> 00:02:23.940
over the past 30 years.

00:02:23.940 --> 00:02:28.160
So slowly, in fact, that I think
if someone told me how

00:02:28.160 --> 00:02:31.380
long it would take to get to
where we are 30 years ago I

00:02:31.380 --> 00:02:33.579
think I would have
hung myself.

00:02:33.579 --> 00:02:36.440
But things do move slowly.

00:02:36.440 --> 00:02:38.500
And it's important to see
how slowly they move.

00:02:38.500 --> 00:02:42.170
Because they will continue to
move slowly in the future.

00:02:42.170 --> 00:02:43.920
And you have to understand that
that's the way things

00:02:43.920 --> 00:02:45.990
work sometimes.

00:02:45.990 --> 00:02:49.590
So to start this all off, we
have to go back to the ideas

00:02:49.590 --> 00:02:53.060
of the legendary David Marr, who
dropped dead from leukemia

00:02:53.060 --> 00:02:56.250
in about 1980.

00:02:56.250 --> 00:03:00.700
I say, the gospel according to
Marr, because he was such a

00:03:00.700 --> 00:03:03.960
powerful and central figure that
almost anything he said

00:03:03.960 --> 00:03:09.800
was believed by a large
collection of devotees.

00:03:09.800 --> 00:03:15.240
But Marr articulated a set of
ideas about how computer

00:03:15.240 --> 00:03:19.810
vision would work that started
off by suggesting that with

00:03:19.810 --> 00:03:26.340
the input from the camera,
you look for edges.

00:03:26.340 --> 00:03:27.700
And you find edge fragments.

00:03:27.700 --> 00:03:35.720
And normally they wouldn't be
even as well-drawn as I've

00:03:35.720 --> 00:03:37.990
done them now.

00:03:37.990 --> 00:03:40.410
Or as badly drawn as
I've done them now.

00:03:40.410 --> 00:03:43.460
But the first step, then, in
visual recognition would be to

00:03:43.460 --> 00:03:46.329
form this edge-based description
of what's out

00:03:46.329 --> 00:03:47.720
there in the world.

00:03:47.720 --> 00:03:49.875
And Marr called that
the primal sketch.

00:03:57.620 --> 00:04:00.960
And from the primal sketch, the
next step was to decorate

00:04:00.960 --> 00:04:06.600
the primal sketch with some
vectors, some surface normals,

00:04:06.600 --> 00:04:12.440
showing where the faces on
the object were oriented.

00:04:12.440 --> 00:04:14.340
He called that the two
and a half D sketch.

00:04:21.360 --> 00:04:22.620
Now why is it two
and a half D?

00:04:22.620 --> 00:04:26.360
Well, it's sort of 2D in the
sense that it's still

00:04:26.360 --> 00:04:31.070
camera-centric in its way of
presenting information.

00:04:31.070 --> 00:04:33.360
But at same time, it attempts
to say something about the

00:04:33.360 --> 00:04:37.110
three-dimensional arrangement
of the faces.

00:04:37.110 --> 00:04:39.610
So the speculation was that you
couldn't get to where you

00:04:39.610 --> 00:04:41.330
wanted to go in one step.

00:04:41.330 --> 00:04:43.970
So you needed several steps
to get from the image to

00:04:43.970 --> 00:04:45.990
something you could recognize.

00:04:45.990 --> 00:04:50.100
And the third step was to
convert the two and a half D

00:04:50.100 --> 00:04:51.850
sketch into generalized
cylinders.

00:05:03.100 --> 00:05:03.960
And the idea is this.

00:05:03.960 --> 00:05:08.140
If you have a regular cylinder,
you can think of it

00:05:08.140 --> 00:05:13.650
as a circular area moving
along an axis like so.

00:05:13.650 --> 00:05:16.570
So that's the description
of a cylinder.

00:05:16.570 --> 00:05:18.990
A circular area moving
along an axis.

00:05:18.990 --> 00:05:23.010
You can get a different kind
of cylinder if you go along

00:05:23.010 --> 00:05:26.320
the same axis but you allow
the size of the circle to

00:05:26.320 --> 00:05:27.820
change as you go.

00:05:27.820 --> 00:05:31.220
So for example, if you were
to describe a wine bottle.

00:05:35.550 --> 00:05:37.590
It would be a function of
distance along the axis that

00:05:37.590 --> 00:05:42.580
would shrink the circle
appropriately to match the

00:05:42.580 --> 00:05:45.520
dimensions of a wine bottle.

00:05:45.520 --> 00:05:46.780
A fine burgundy, I perceive.

00:05:46.780 --> 00:05:50.920
In any case, this one once
converted into a generalized

00:05:50.920 --> 00:05:54.260
cylinder, when matched against
a library of such

00:05:54.260 --> 00:05:58.875
descriptions, results
in recognition.

00:06:04.290 --> 00:06:07.190
Great theory, based on the idea
that you start off by

00:06:07.190 --> 00:06:11.330
looking at edges and you end
up, in several steps of

00:06:11.330 --> 00:06:14.470
transformation, producing
something that you could look

00:06:14.470 --> 00:06:17.350
up in a library of
descriptions.

00:06:17.350 --> 00:06:20.100
Great idea.

00:06:20.100 --> 00:06:23.840
Trouble is, no one could
make it work.

00:06:26.950 --> 00:06:28.976
It was too hard to do this.

00:06:28.976 --> 00:06:31.250
It was too hard to do that.

00:06:31.250 --> 00:06:32.980
And the generalized cylinders
produced, if

00:06:32.980 --> 00:06:35.980
any, were too coarse.

00:06:35.980 --> 00:06:37.520
You couldn't tell the difference
between a Ford and

00:06:37.520 --> 00:06:40.520
a Chevrolet or between a
Volkswagen and a Cadillac.

00:06:40.520 --> 00:06:43.010
Because they were
just too coarse.

00:06:43.010 --> 00:06:45.655
So although it was a great idea
based on the idea that

00:06:45.655 --> 00:06:49.580
you have to do recognition in
several transformations of

00:06:49.580 --> 00:06:55.430
representational apparatus,
it just didn't work.

00:06:55.430 --> 00:07:00.880
So much later, maybe 15 years
later or so, we get to the

00:07:00.880 --> 00:07:02.610
next part of our story.

00:07:02.610 --> 00:07:07.590
Which is the alignment theories,
most notably the one

00:07:07.590 --> 00:07:10.060
produced by Shimon Ullman,
one of Marr's students.

00:07:12.580 --> 00:07:16.810
So the alignment theory of
recognition is based on a very

00:07:16.810 --> 00:07:19.250
strange and exotic idea.

00:07:19.250 --> 00:07:22.930
It doesn't seem strange and
exotic to mechanical engineers

00:07:22.930 --> 00:07:25.470
for a while, because they're
used to mechanical drawings.

00:07:25.470 --> 00:07:28.620
But here's the strange
and miraculous idea.

00:07:28.620 --> 00:07:30.110
Imagine this object.

00:07:30.110 --> 00:07:33.540
You take three pictures of it.

00:07:33.540 --> 00:07:38.390
You can reconstruct any
view of that object.

00:07:38.390 --> 00:07:41.860
Now, I have to be a little bit
careful about how I say that.

00:07:41.860 --> 00:07:46.960
First of all, some of the
vertexes are not visible in

00:07:46.960 --> 00:07:48.490
the views that you have.

00:07:48.490 --> 00:07:50.880
So, of course, you can't
do anything with those.

00:07:50.880 --> 00:07:53.600
So let's say that we have a
transparent object where you

00:07:53.600 --> 00:07:55.570
can see all the vertexes.

00:07:55.570 --> 00:07:59.840
If you have three pictures of
that, you can reconstruct any

00:07:59.840 --> 00:08:01.990
view of that object.

00:08:01.990 --> 00:08:04.090
Now I have to be a little
careful about how I say that,

00:08:04.090 --> 00:08:06.430
because it's not true.

00:08:06.430 --> 00:08:09.730
What's true is, you can produce
any view of that in

00:08:09.730 --> 00:08:11.620
orthographic projection.

00:08:11.620 --> 00:08:13.670
So if you're close enough to
the object that you get

00:08:13.670 --> 00:08:15.010
perspective, it doesn't work.

00:08:15.010 --> 00:08:18.420
But for the most part, you can
neglect perspective after you

00:08:18.420 --> 00:08:20.860
get about two and a half
times as far away as

00:08:20.860 --> 00:08:22.420
the object is big.

00:08:22.420 --> 00:08:24.830
And you can presume that
you've got orthographic

00:08:24.830 --> 00:08:27.560
projection.

00:08:27.560 --> 00:08:29.500
So that's a strange
and exotic idea.

00:08:29.500 --> 00:08:31.080
But how can you make
a recognition

00:08:31.080 --> 00:08:32.150
theory out of that?

00:08:32.150 --> 00:08:33.740
So let me show you.

00:08:33.740 --> 00:08:36.395
Well, here's one drawing of the
object, I need two more.

00:08:39.230 --> 00:08:40.020
Let's see.

00:08:40.020 --> 00:08:41.270
Let's have this one.

00:08:48.440 --> 00:08:50.620
And maybe one that's tilted
up a little bit.

00:08:58.140 --> 00:09:05.360
It's important that these
pictures not be just rotations

00:09:05.360 --> 00:09:06.030
on one axis.

00:09:06.030 --> 00:09:07.860
Because they wouldn't form what
you might think of as a

00:09:07.860 --> 00:09:09.870
kind of basis set.

00:09:09.870 --> 00:09:10.850
So there are three pictures.

00:09:10.850 --> 00:09:12.110
We'll call them a, b, and c.

00:09:15.830 --> 00:09:18.860
And then we want a
fourth picture.

00:09:18.860 --> 00:09:21.172
Which will look like this.

00:09:21.172 --> 00:09:24.570
It doesn't have to
be too precise.

00:09:24.570 --> 00:09:26.890
And we'll call that
the unknown.

00:09:26.890 --> 00:09:33.220
And what we really want to know
is if the unknown is the

00:09:33.220 --> 00:09:37.100
same object that these three
pictures were made from.

00:09:41.170 --> 00:09:44.230
So let me begin with
an assertion.

00:09:44.230 --> 00:09:47.570
I'll need four colors of chalk
to make this assertion.

00:09:47.570 --> 00:09:51.310
What I want to do is I want to
pick a particular place on the

00:09:51.310 --> 00:09:52.560
object, like this one.

00:09:55.770 --> 00:09:58.220
And maybe the same place on
this object over here.

00:09:58.220 --> 00:10:00.790
Those are corresponding
places, right?

00:10:00.790 --> 00:10:05.480
So I can now write an equation
that the x-coordinate of that

00:10:05.480 --> 00:10:12.690
unknown object is equal to, oh,
I don't know, alpha x sub

00:10:12.690 --> 00:10:24.620
a plus beta x sub b plus
gamma x sub c plus

00:10:24.620 --> 00:10:27.460
some constant, tau.

00:10:27.460 --> 00:10:29.010
Well, of course, that's
obviously true.

00:10:29.010 --> 00:10:32.330
Because I'm letting you take
those alpha, beta, gamma, and

00:10:32.330 --> 00:10:33.890
tau and make them anything
you want.

00:10:36.680 --> 00:10:39.870
So although that's conspicuously
obviously true,

00:10:39.870 --> 00:10:41.630
it's not interesting.

00:10:41.630 --> 00:10:42.910
So let me take another point.

00:10:45.410 --> 00:10:48.680
And of course, I can write the
same equation down for this

00:10:48.680 --> 00:10:49.930
purple point.

00:11:01.800 --> 00:11:05.190
And now that I'm on a roll and
having a great deal of fun

00:11:05.190 --> 00:11:12.760
with this, I can
take this point

00:11:12.760 --> 00:11:14.010
and make a blue equation.

00:11:26.110 --> 00:11:29.840
And you know I'm destined
to do it, so I've

00:11:29.840 --> 00:11:31.050
got one more color.

00:11:31.050 --> 00:11:32.872
I might as well use it.

00:11:32.872 --> 00:11:36.350
Let's just make sure I get
something that works here.

00:11:36.350 --> 00:11:38.810
That's this one, that's
this one.

00:11:38.810 --> 00:11:42.180
I hope I've got these
correspondences right.

00:11:42.180 --> 00:11:42.700
STUDENT: [INAUDIBLE].

00:11:42.700 --> 00:11:44.030
PATRICK WINSTON: Have
I got one off?

00:11:44.030 --> 00:11:45.190
STUDENT: [INAUDIBLE].

00:11:45.190 --> 00:11:45.865
PATRICK WINSTON: Which color?

00:11:45.865 --> 00:11:46.210
STUDENT: Blue.

00:11:46.210 --> 00:11:47.570
[INAUDIBLE].

00:11:47.570 --> 00:11:47.930
PATRICK WINSTON: OK.

00:11:47.930 --> 00:11:51.500
So this one goes with this
one, goes with this one.

00:11:51.500 --> 00:11:52.200
Is that one wrong?

00:11:52.200 --> 00:11:54.920
STUDENTS: Yeah.

00:11:54.920 --> 00:11:56.100
PATRICK WINSTON: Oh, oh, oh.

00:11:56.100 --> 00:12:00.442
Of course this one, excuse
me, goes down here.

00:12:00.442 --> 00:12:02.380
Right?

00:12:02.380 --> 00:12:05.520
And then this one
is off as well.

00:12:05.520 --> 00:12:07.870
I wouldn't get a very good
recognition scheme if I can't

00:12:07.870 --> 00:12:10.630
get those correspondences
right.

00:12:10.630 --> 00:12:16.108
Which is one of the
lessons of today.

00:12:16.108 --> 00:12:16.550
OK.

00:12:16.550 --> 00:12:17.820
Now I've got them right.

00:12:17.820 --> 00:12:19.700
And now that equation
is correct.

00:12:19.700 --> 00:12:22.970
I think I've got this
one right already.

00:12:22.970 --> 00:12:24.730
So now I can just
write that down.

00:12:24.730 --> 00:12:26.910
I'm on a roll, I'm just
copying this.

00:12:37.870 --> 00:12:41.460
So those are a bunch
of equations.

00:12:41.460 --> 00:12:49.530
And now the astonishing part
is that I can choose alpha,

00:12:49.530 --> 00:12:56.070
beta, gamma, and tau
to be all the same.

00:12:59.710 --> 00:13:03.200
That is, there's one set of
alpha, beta, gamma, and tau

00:13:03.200 --> 00:13:07.220
that works for everything,
for all four points.

00:13:07.220 --> 00:13:09.330
So you look at that puzzled.

00:13:09.330 --> 00:13:10.450
And that's OK to be puzzled.

00:13:10.450 --> 00:13:11.890
Because I certainly
haven't proved it.

00:13:11.890 --> 00:13:14.430
I'm asserting it.

00:13:14.430 --> 00:13:15.890
But right away, there's
something interesting about

00:13:15.890 --> 00:13:18.330
this and that is that the
relationship between the

00:13:18.330 --> 00:13:22.020
points on the unknown object and
the points in this stored

00:13:22.020 --> 00:13:28.700
library of images are
related linearly.

00:13:28.700 --> 00:13:31.570
That's true because it's
orthographic projection.

00:13:31.570 --> 00:13:32.880
Linearly related.

00:13:32.880 --> 00:13:38.610
So I can generate the points in
some fourth object from the

00:13:38.610 --> 00:13:43.740
points in three sample objects
with linear operations.

00:13:43.740 --> 00:13:43.990
Christopher?

00:13:43.990 --> 00:13:46.950
STUDENT: Is that the
x-coordinate of--

00:13:46.950 --> 00:13:48.200
PATRICK WINSTON: It's
the x-coordinate.

00:13:50.840 --> 00:13:52.380
Christopher asked about
the x-coordinates.

00:13:52.380 --> 00:13:55.850
Each of these x-coordinates are
meant to be color coded.

00:13:55.850 --> 00:14:00.130
It gets a little complicated
with notation and stuff.

00:14:00.130 --> 00:14:03.720
So that's the reason I'm color
coding the coordinates.

00:14:03.720 --> 00:14:09.710
So the orange x sub u is the
x-coordinate of that

00:14:09.710 --> 00:14:10.570
particular point.

00:14:10.570 --> 00:14:12.390
STUDENT: In 3D space?

00:14:12.390 --> 00:14:13.610
PATRICK WINSTON: No.

00:14:13.610 --> 00:14:14.480
Not in 3D space.

00:14:14.480 --> 00:14:15.536
In the image.

00:14:15.536 --> 00:14:17.720
STUDENT: So it's a 2D
projection of it?

00:14:17.720 --> 00:14:19.750
PATRICK WINSTON: It's a 2D
projection of it, an

00:14:19.750 --> 00:14:20.600
orthographic projection.

00:14:20.600 --> 00:14:21.450
OK?

00:14:21.450 --> 00:14:24.010
So we're looking at drawings.

00:14:24.010 --> 00:14:26.500
And those coordinates over there
are the two-dimensional

00:14:26.500 --> 00:14:29.470
coordinates in the drawing.

00:14:29.470 --> 00:14:32.615
Just as if it were
on your retina.

00:14:32.615 --> 00:14:34.555
STUDENT: [INAUDIBLE]

00:14:34.555 --> 00:14:39.410
vertexes on the 3D projection
or can curved surfaces also?

00:14:39.410 --> 00:14:41.470
PATRICK WINSTON: So he asked
about curved surfaces.

00:14:41.470 --> 00:14:43.810
And the answer is that you have
to find corresponding

00:14:43.810 --> 00:14:45.610
points on the object.

00:14:45.610 --> 00:14:50.070
So if you have a totally curved
surface and you can't

00:14:50.070 --> 00:14:52.940
identify any corresponding
points, you lose.

00:14:52.940 --> 00:14:56.230
But if you consider our faces,
there are some obvious points,

00:14:56.230 --> 00:14:58.700
even though our face are
not by any means

00:14:58.700 --> 00:15:00.970
flat like these objects.

00:15:00.970 --> 00:15:03.260
We have the tip of our nose
and the center of our

00:15:03.260 --> 00:15:06.290
eyeballs and so on.

00:15:06.290 --> 00:15:11.250
So if that's true, what does
that mean about recovering

00:15:11.250 --> 00:15:15.390
alpha, beta, gamma, and tau?

00:15:15.390 --> 00:15:18.590
Can we find them?

00:15:18.590 --> 00:15:19.890
[INAUDIBLE], what
do you think?

00:15:19.890 --> 00:15:21.300
How do we go about
finding them?

00:15:21.300 --> 00:15:22.920
You're nodding your head
in the right direction.

00:15:22.920 --> 00:15:23.880
[LAUGHTER]

00:15:23.880 --> 00:15:26.280
STUDENT: It's four
equations and--

00:15:26.280 --> 00:15:26.550
PATRICK WINSTON: Splendid.

00:15:26.550 --> 00:15:28.712
It's four equations
and four unknowns.

00:15:28.712 --> 00:15:30.400
Four linear equations
and four unknowns.

00:15:30.400 --> 00:15:33.560
So obviously, you can solve for
alpha, beta, gamma, and

00:15:33.560 --> 00:15:37.960
tau if you know that these
equations are correct.

00:15:37.960 --> 00:15:41.390
So how does that help
us with recognition?

00:15:41.390 --> 00:15:44.000
It helps us with recognition
because we can take another

00:15:44.000 --> 00:15:51.820
point, let me say this square
point here and this

00:15:51.820 --> 00:15:58.410
corresponding square point here
and this corresponding

00:15:58.410 --> 00:16:00.750
square point here, and what
can we do with those three

00:16:00.750 --> 00:16:02.460
points now?

00:16:02.460 --> 00:16:05.610
We've got alpha, beta, gamma,
and tau, so we can predict

00:16:05.610 --> 00:16:09.060
where it's going to be
in the fourth image.

00:16:09.060 --> 00:16:15.310
So we can predict that that
square point is going to be

00:16:15.310 --> 00:16:17.060
right there.

00:16:17.060 --> 00:16:21.030
And if it isn't, we're highly
suspicious about whether this

00:16:21.030 --> 00:16:25.230
object is the kind of object
we think it is.

00:16:25.230 --> 00:16:26.480
So you look at me
in disbelief.

00:16:26.480 --> 00:16:29.036
You'd like me to demonstrate
this, I imagine.

00:16:29.036 --> 00:16:29.472
STUDENT: Yeah.

00:16:29.472 --> 00:16:31.220
PATRICK WINSTON: OK.

00:16:31.220 --> 00:16:32.500
Let me see if I can
demonstrate this.

00:16:51.480 --> 00:16:57.110
So I'm going to do this in a
slightly simplified version.

00:16:57.110 --> 00:17:01.320
I'm only going to allow
rotation around

00:17:01.320 --> 00:17:03.270
the vertical axis.

00:17:03.270 --> 00:17:05.680
And just so you know I'm not
cheating, there's a little

00:17:05.680 --> 00:17:09.358
slider here that rotates
that third object.

00:17:09.358 --> 00:17:12.770
Let's see, why are there
just two known

00:17:12.770 --> 00:17:13.940
objects and one unknown?

00:17:13.940 --> 00:17:16.740
Well that's because I've
restricted the motion to

00:17:16.740 --> 00:17:20.848
rotation around the vertical
axis and some translation.

00:17:20.848 --> 00:17:24.630
So now that I've spun that
around a little bit, let me

00:17:24.630 --> 00:17:27.300
pick some corresponding
points.

00:17:27.300 --> 00:17:28.508
Oops.

00:17:28.508 --> 00:17:29.758
What's happened?

00:17:41.240 --> 00:17:41.520
Wow.

00:17:41.520 --> 00:17:42.840
Let me run that by again.

00:18:04.430 --> 00:18:04.780
OK.

00:18:04.780 --> 00:18:07.060
So there's one point
I've selected

00:18:07.060 --> 00:18:08.970
from the model objects.

00:18:08.970 --> 00:18:10.830
The corresponding point
over here on the

00:18:10.830 --> 00:18:12.400
unknown is right there.

00:18:12.400 --> 00:18:13.350
I'm going to be a little off.

00:18:13.350 --> 00:18:15.120
But that's OK.

00:18:15.120 --> 00:18:18.480
So let me just pick that
one and then that

00:18:18.480 --> 00:18:20.650
corresponds to this one.

00:18:20.650 --> 00:18:23.870
Krishna, would you like to
specify a point so people know

00:18:23.870 --> 00:18:25.680
I'm not cheating.

00:18:25.680 --> 00:18:27.900
Pick a point.

00:18:27.900 --> 00:18:29.050
Pick a point, Krishna.

00:18:29.050 --> 00:18:30.874
STUDENT: Oh, the right?

00:18:30.874 --> 00:18:32.020
PATRICK WINSTON: The right?

00:18:32.020 --> 00:18:32.700
STUDENT: Yeah.

00:18:32.700 --> 00:18:33.190
PATRICK WINSTON: This one?

00:18:33.190 --> 00:18:35.046
STUDENT: Yep.

00:18:35.046 --> 00:18:35.510
PATRICK WINSTON: Oops.

00:18:35.510 --> 00:18:37.310
OK, let's pick it out
on the model first.

00:18:37.310 --> 00:18:40.175
Now pick it over here.

00:18:40.175 --> 00:18:40.670
Boom.

00:18:40.670 --> 00:18:43.290
So all the points are where
they're supposed to be.

00:18:43.290 --> 00:18:45.690
Isn't that cool?

00:18:45.690 --> 00:18:47.640
Well, let's suppose that the
unknown is something else.

00:18:50.540 --> 00:18:52.710
This is a carefully
selected object.

00:18:52.710 --> 00:18:57.320
Because the points are all the
correct positions vertically,

00:18:57.320 --> 00:18:59.300
but they're not necessarily the
correct positions in the

00:18:59.300 --> 00:19:00.950
other two dimensions.

00:19:00.950 --> 00:19:08.830
So let me pick this point, and
this point, and this point,

00:19:08.830 --> 00:19:11.160
and this point.

00:19:11.160 --> 00:19:14.630
And Krishna had me
pick this point.

00:19:14.630 --> 00:19:17.220
So let me pick this point.

00:19:17.220 --> 00:19:21.530
So if it thinks that the
unknown is one of these

00:19:21.530 --> 00:19:25.420
obelisk objects, then we would
expect to see all of the

00:19:25.420 --> 00:19:28.280
corresponding points correctly
identified.

00:19:28.280 --> 00:19:29.780
But boom.

00:19:29.780 --> 00:19:31.030
All the points are off.

00:19:34.850 --> 00:19:37.360
So it seems to work in this
particular example.

00:19:37.360 --> 00:19:43.640
I find the alpha and beta
using two images.

00:19:43.640 --> 00:19:47.010
And I predict the locations
of the other points.

00:19:47.010 --> 00:19:49.630
And I determine whether those
positions are correct.

00:19:49.630 --> 00:19:51.780
And if they are correct, then I
have a pretty good idea that

00:19:51.780 --> 00:19:54.840
I have in fact identified the
object on the right as either

00:19:54.840 --> 00:19:59.540
an obelisk or an organ,
depending on which of the

00:19:59.540 --> 00:20:05.420
model choices and the unknown
choices I've selected.

00:20:05.420 --> 00:20:09.550
So the only thing I have left
to do is to demonstrate that

00:20:09.550 --> 00:20:12.160
what I said about
this is true.

00:20:12.160 --> 00:20:14.880
So I'm going to actually
demonstrate that what I said

00:20:14.880 --> 00:20:18.710
about this is true using the
configuration in this

00:20:18.710 --> 00:20:19.640
demonstration.

00:20:19.640 --> 00:20:23.120
Because it's much too hard
for me to remember matrix

00:20:23.120 --> 00:20:25.930
transformations for generalized
rotation in three

00:20:25.930 --> 00:20:27.600
dimensions.

00:20:27.600 --> 00:20:28.850
So here's how it's
going to work.

00:20:33.410 --> 00:20:37.140
The z-axis is going
up that way.

00:20:37.140 --> 00:20:41.640
Or, it's going to be pointing
toward you.

00:20:41.640 --> 00:20:43.760
And what I'm going to
do is I'm going to

00:20:43.760 --> 00:20:46.820
rotate around this axis.

00:20:46.820 --> 00:20:49.750
And what I want to do is I
want to find out how the

00:20:49.750 --> 00:20:52.670
x-coordinate in the image
of the points move

00:20:52.670 --> 00:20:53.920
as I do that rotation.

00:20:56.350 --> 00:21:00.300
So here's the x-axis.

00:21:00.300 --> 00:21:03.180
This is the coordinate
that you can see.

00:21:03.180 --> 00:21:05.520
Here is the y-axis.

00:21:05.520 --> 00:21:08.010
That's in depth, so you can't
tell how far away it is.

00:21:10.750 --> 00:21:12.180
And the z-axis--

00:21:12.180 --> 00:21:17.060
x, y, z-axis must be pointing
out that way toward you.

00:21:17.060 --> 00:21:21.660
So now I'm going to consider
just a single point on the

00:21:21.660 --> 00:21:24.310
object and see what
happens to it.

00:21:24.310 --> 00:21:31.640
So I'm going to say to myself,
let's put the object in some

00:21:31.640 --> 00:21:32.650
kind of standard position.

00:21:32.650 --> 00:21:34.300
I don't care what it is.

00:21:34.300 --> 00:21:36.450
It can be just random,
just spin it around.

00:21:36.450 --> 00:21:42.810
Some position, we'll call that
the standard position, S. And

00:21:42.810 --> 00:21:46.770
that means that the x-coordinate
of the standard

00:21:46.770 --> 00:21:49.890
position is x sub s.

00:21:49.890 --> 00:21:57.520
And the y-coordinate of the
standard position is y sub s.

00:21:57.520 --> 00:22:01.930
And now I'm going to rotate
the object three times.

00:22:01.930 --> 00:22:05.280
Once to form the a picture, once
to form the b picture,

00:22:05.280 --> 00:22:07.110
and once to form
the c picture.

00:22:07.110 --> 00:22:09.960
And you can make
those choices.

00:22:09.960 --> 00:22:12.330
Those can be anything, right?

00:22:12.330 --> 00:22:18.540
So let's say that the a
picture is out here.

00:22:18.540 --> 00:22:22.190
So that's the a picture.

00:22:22.190 --> 00:22:25.040
The B picture is out here.

00:22:25.040 --> 00:22:27.610
And the unknown is
up that way.

00:22:32.000 --> 00:22:37.540
And so what I want to know
depends on these vectors.

00:22:37.540 --> 00:22:41.220
We'll call that theta sub a,
and this is theta sub b.

00:22:45.230 --> 00:22:50.430
And this one is theta sub u.

00:22:50.430 --> 00:23:01.950
So I would like to know how
x sub a depends on x

00:23:01.950 --> 00:23:05.490
sub s and y sub s.

00:23:05.490 --> 00:23:08.490
And I can never remember how
to do that, because I can

00:23:08.490 --> 00:23:09.810
never remember the
transformation

00:23:09.810 --> 00:23:11.480
equations for rotation.

00:23:11.480 --> 00:23:14.880
So I have to figure
it out every time.

00:23:14.880 --> 00:23:16.090
And this is no exception.

00:23:16.090 --> 00:23:18.870
So what I'm going to say is that
this vector that goes out

00:23:18.870 --> 00:23:21.900
to S consists of two pieces.

00:23:21.900 --> 00:23:25.570
There's the x part
and the y part.

00:23:25.570 --> 00:23:30.390
And I know that I can rotate
this vector by alpha sub a by

00:23:30.390 --> 00:23:33.130
rotating this vector and
rotating that vector and

00:23:33.130 --> 00:23:35.370
adding up the results.

00:23:35.370 --> 00:23:39.930
So if I rotate this vector
by alpha sub a, then the

00:23:39.930 --> 00:23:46.200
contribution of that to the
x-coordinate of a is going to

00:23:46.200 --> 00:23:52.790
be given by the cosine
of theta sub a

00:23:52.790 --> 00:23:56.530
multiplied by x sub s.

00:23:56.530 --> 00:23:59.360
So you can just exaggerate that
motion, say, well if I

00:23:59.360 --> 00:24:03.220
pitch it up that way then the
projection down on the x-axis

00:24:03.220 --> 00:24:06.520
is going to be this length
of the vector times the

00:24:06.520 --> 00:24:07.770
cosine of the angle.

00:24:10.410 --> 00:24:15.930
Now there's also going to be
a dependence on y sub s.

00:24:15.930 --> 00:24:17.820
Let's figure out what
that's going to be.

00:24:17.820 --> 00:24:19.065
I've got this vector here.

00:24:19.065 --> 00:24:22.220
And I'm going to rotate it
by theta sub a as well.

00:24:22.220 --> 00:24:25.250
If I rotate that by theta sub a
and see what the projection

00:24:25.250 --> 00:24:28.020
is on the x-axis, that's
going to be given by

00:24:28.020 --> 00:24:30.570
the sine of the angle.

00:24:30.570 --> 00:24:34.080
But it's going the wrong way, so
I have to subtract it off.

00:24:34.080 --> 00:24:36.825
So that's how I don't have to
remember what the signs are on

00:24:36.825 --> 00:24:38.075
these equations.

00:24:44.520 --> 00:24:45.450
Well, that was good.

00:24:45.450 --> 00:24:47.940
And now that I'm off
and running I can

00:24:47.940 --> 00:24:48.710
do what I did before.

00:24:48.710 --> 00:24:50.280
It makes it easy to
give the lecture.

00:24:50.280 --> 00:24:54.120
Because this is going to be x
sub b is equal to x sub s

00:24:54.120 --> 00:25:00.690
times the cosine of theta sub
b minus y sub s times the

00:25:00.690 --> 00:25:03.240
cosine of theta--

00:25:03.240 --> 00:25:05.690
oh, you're letting
me make mistakes.

00:25:05.690 --> 00:25:06.940
Shame.

00:25:09.740 --> 00:25:12.050
I can generally tell by all
the troubled looks.

00:25:12.050 --> 00:25:14.280
But there should be some
shouting as well.

00:25:14.280 --> 00:25:18.150
That's the sine and
that's the sine.

00:25:18.150 --> 00:25:19.780
And one more time.

00:25:19.780 --> 00:25:26.890
x sub u is equal to x sub s
times the cosine of theta sub

00:25:26.890 --> 00:25:33.830
u minus y sub s times the
sine of theta sub u.

00:25:33.830 --> 00:25:36.670
And I forgot the b up there.

00:25:36.670 --> 00:25:37.880
So there are some equations.

00:25:37.880 --> 00:25:39.710
And we don't know what
we're doing.

00:25:39.710 --> 00:25:41.610
We're just going to stare at
them awhile and see if they

00:25:41.610 --> 00:25:42.860
sing us a song.

00:25:45.200 --> 00:25:48.480
So let's see if they
sing us a song.

00:25:48.480 --> 00:25:54.350
What about x sub
a and x sub b?

00:25:54.350 --> 00:25:57.160
These are things that
we see in the image.

00:25:57.160 --> 00:25:58.520
These are things that
we can measure.

00:26:10.850 --> 00:26:14.100
What about all those cosines
and sines of theta

00:26:14.100 --> 00:26:16.000
a's and theta b's.

00:26:16.000 --> 00:26:18.010
Well, we have no idea
what they are.

00:26:18.010 --> 00:26:20.420
But one thing is clear.

00:26:20.420 --> 00:26:25.260
They're true for all of the
points on the object.

00:26:25.260 --> 00:26:28.740
Because when we rotate the
object around by angle theta,

00:26:28.740 --> 00:26:31.510
we're rotating all of
the points through

00:26:31.510 --> 00:26:33.790
the same angle, right?

00:26:33.790 --> 00:26:35.250
So with respect to any

00:26:35.250 --> 00:26:38.790
particular view of the object--

00:26:38.790 --> 00:26:41.810
here we are in the standard
position.

00:26:41.810 --> 00:26:44.590
Here we are in position a.

00:26:44.590 --> 00:26:46.830
The vectors to all of the
points on the object are

00:26:46.830 --> 00:26:50.510
rotated by the same angle when
we go from the standard

00:26:50.510 --> 00:26:53.240
position to the a position.

00:26:53.240 --> 00:27:02.100
So that means that for all of
the images in this particular

00:27:02.100 --> 00:27:06.940
rendering, with a particular
rotation by theta a, theta b,

00:27:06.940 --> 00:27:09.465
and theta u, those
are constants.

00:27:15.820 --> 00:27:18.125
Now remember this is for
a particular theta a, a

00:27:18.125 --> 00:27:21.140
particular theta be, and
a particular theta u.

00:27:21.140 --> 00:27:23.850
As long as we're talking about
all of the points for each of

00:27:23.850 --> 00:27:28.920
those rotations, those angles
and cosines are going to be

00:27:28.920 --> 00:27:35.090
the same for all possible
points on the object.

00:27:37.780 --> 00:27:38.110
OK.

00:27:38.110 --> 00:27:41.790
So now we go back to our high
school algebra experts and we

00:27:41.790 --> 00:27:49.880
say, look at these first two
equations, We've got two

00:27:49.880 --> 00:27:55.540
equations and what we can now
construe to be two unknowns.

00:27:55.540 --> 00:27:57.210
What are the unknowns
that are left?

00:27:57.210 --> 00:27:58.990
We can measure a and b.

00:27:58.990 --> 00:28:00.700
Whatever the cosines
are, they're the

00:28:00.700 --> 00:28:02.660
same for all the pictures.

00:28:02.660 --> 00:28:05.580
So if we treat those as
constants, then we can solve

00:28:05.580 --> 00:28:08.770
for x sub s and y sub s.

00:28:08.770 --> 00:28:10.490
Right?

00:28:10.490 --> 00:28:14.860
We can solve for x sub s and y
sub s in terms of x sub a and

00:28:14.860 --> 00:28:20.190
x sub b and a whole bunch
of constants.

00:28:20.190 --> 00:28:27.220
But, I don't know, a whole bunch
of constants, let's see.

00:28:27.220 --> 00:28:30.640
We can gather up all of those
cosines and ratios of sines

00:28:30.640 --> 00:28:34.350
and cosines and all that stuff
and put them all together.

00:28:34.350 --> 00:28:36.130
Because they're all constants.

00:28:36.130 --> 00:28:38.320
And then we can do this.

00:28:38.320 --> 00:28:48.060
We can say x sub
u is equal to--

00:28:48.060 --> 00:28:54.010
well, it's going to depend
on x sub a and x sub b.

00:28:54.010 --> 00:28:58.030
And by the time we wash or
manipulate or screw around

00:28:58.030 --> 00:29:03.220
with all those cosines, we can
say that the multiplier for x

00:29:03.220 --> 00:29:07.670
sub a is some constant alpha and
the multiplier for x sub b

00:29:07.670 --> 00:29:09.910
is some constant beta.

00:29:09.910 --> 00:29:11.500
So that's not a slight
of hand.

00:29:11.500 --> 00:29:12.710
That's just linear

00:29:12.710 --> 00:29:15.300
manipulation of those equations.

00:29:15.300 --> 00:29:17.940
And that's what we wanted to
show, that for orthographic

00:29:17.940 --> 00:29:21.130
projection, which this is--
there is no perspective

00:29:21.130 --> 00:29:23.530
involved here, we're just taking
the projection along

00:29:23.530 --> 00:29:24.780
the x-axis--

00:29:26.480 --> 00:29:30.060
we can demonstrate for this
simplified situation that that

00:29:30.060 --> 00:29:31.310
equation must hold.

00:29:33.880 --> 00:29:35.310
Now I want to give you
a few puzzles.

00:29:35.310 --> 00:29:36.730
Because this stuff
is so simple.

00:29:36.730 --> 00:29:41.020
Suppose I allow translation
as well as rotation.

00:29:41.020 --> 00:29:42.696
What's going to happen?

00:29:42.696 --> 00:29:44.094
STUDENT: You just get the tau.

00:29:44.094 --> 00:29:44.560
Basically, you get a constant.

00:29:44.560 --> 00:29:46.180
PATRICK WINSTON: Yeah, you
add a constant, tau.

00:29:46.180 --> 00:29:47.760
But what do we need to do
in order to solve it?

00:29:47.760 --> 00:29:49.221
STUDENT: Subtract them
[INAUDIBLE].

00:29:49.221 --> 00:29:52.630
You subtract two equations
and then [INAUDIBLE].

00:29:52.630 --> 00:29:54.950
PATRICK WINSTON: Let's
see, now we've got

00:29:54.950 --> 00:29:56.206
three unknowns, right?

00:29:56.206 --> 00:29:56.985
I don't know tau.

00:29:56.985 --> 00:29:58.216
I don't know x sub s.

00:29:58.216 --> 00:30:00.910
And I don't know y sub s.

00:30:00.910 --> 00:30:02.650
So I need another equation.

00:30:02.650 --> 00:30:04.534
Where do I get the
other equation.

00:30:04.534 --> 00:30:05.430
STUDENT: [INAUDIBLE].

00:30:05.430 --> 00:30:06.680
PATRICK WINSTON: From
another picture.

00:30:09.910 --> 00:30:14.150
That's why up there I
needed four points.

00:30:14.150 --> 00:30:17.690
That covers a situation where
I've got three degrees of

00:30:17.690 --> 00:30:20.080
rotation and translation.

00:30:20.080 --> 00:30:25.360
Here I got by with just two
pictures in this illustration.

00:30:25.360 --> 00:30:27.840
That one involved a tau
translational element, so I

00:30:27.840 --> 00:30:28.720
needed three pictures.

00:30:28.720 --> 00:30:32.700
And this one's got full
rotation, so I needed four.

00:30:32.700 --> 00:30:40.226
So great idea, works fine.

00:30:40.226 --> 00:30:45.410
The trouble is it doesn't work
so fine on natural objects.

00:30:45.410 --> 00:30:48.630
It works fine on things that are
manufactured because they

00:30:48.630 --> 00:30:51.250
all have identical dimensions.

00:30:51.250 --> 00:30:55.420
So if I made a million of these
in a factory, I'd have

00:30:55.420 --> 00:30:56.950
no trouble recognizing them.

00:30:56.950 --> 00:31:02.090
Because all I'd have to do is
take three pictures, record

00:31:02.090 --> 00:31:04.840
the coordinates of some of the
points, and I'd be done.

00:31:04.840 --> 00:31:07.245
The trouble is the natural
world isn't like this.

00:31:10.410 --> 00:31:13.080
And you aren't like
this either.

00:31:16.100 --> 00:31:21.020
I don't know, if I'm trying to
recognize faces, it's not that

00:31:21.020 --> 00:31:23.700
easy to do all this.

00:31:23.700 --> 00:31:27.380
First of all, it's a little
difficult to find the exact

00:31:27.380 --> 00:31:30.390
point, the exactly corresponding
points.

00:31:30.390 --> 00:31:32.950
I made a mistake in
doing it myself.

00:31:32.950 --> 00:31:35.230
And if the computer made a
mistake it would certainly

00:31:35.230 --> 00:31:36.060
make an error.

00:31:36.060 --> 00:31:39.440
Because it would be using
non-corresponding points to

00:31:39.440 --> 00:31:40.120
make the prediction.

00:31:40.120 --> 00:31:42.656
So it would be way off.

00:31:42.656 --> 00:31:47.070
But this is still in the
tradition of working from

00:31:47.070 --> 00:31:51.950
local features in the objects
toward recognition.

00:31:51.950 --> 00:31:58.790
So having looked at that theory,
we also find it a

00:31:58.790 --> 00:31:59.350
little wanting.

00:31:59.350 --> 00:32:02.280
It works great it some
circumstances, doesn't seem to

00:32:02.280 --> 00:32:03.790
solve the whole recognition
problem.

00:32:07.190 --> 00:32:09.590
Years pass.

00:32:09.590 --> 00:32:13.600
Shimon Ullman comes up with
another theory that's not so

00:32:13.600 --> 00:32:20.310
much based on edge fragments or
the location of particular

00:32:20.310 --> 00:32:27.570
features but rather
on correlation.

00:32:27.570 --> 00:32:32.930
Taking a picture of, say,
Krishna's face, taking a

00:32:32.930 --> 00:32:37.120
picture of the whole class, and
then using that as a kind

00:32:37.120 --> 00:32:40.680
of correlation mask, running it
all over the picture of the

00:32:40.680 --> 00:32:43.050
class, seeing where
it maximizes out.

00:32:43.050 --> 00:32:43.600
Now that's vague.

00:32:43.600 --> 00:32:45.480
I'll explain when I'm talking
about [INAUDIBLE]

00:32:45.480 --> 00:32:47.610
correlation in a minute.

00:32:47.610 --> 00:32:53.400
But it's basically saying, if
I have a picture of Krishna,

00:32:53.400 --> 00:32:54.390
where do I find him?

00:32:54.390 --> 00:32:55.902
I'll find him in one place.

00:32:55.902 --> 00:32:57.750
But you know what?

00:32:57.750 --> 00:33:00.410
Krishna doesn't look
like anybody else.

00:33:00.410 --> 00:33:02.810
So I might not find
any other faces.

00:33:02.810 --> 00:33:06.840
And if my objective is to find
all the faces, then maybe that

00:33:06.840 --> 00:33:09.150
idea won't work either.

00:33:09.150 --> 00:33:13.590
Or, to take another example,
here's a dollar bill.

00:33:13.590 --> 00:33:18.130
We haven't had raises in quite
well, so this is my last one.

00:33:18.130 --> 00:33:20.950
It's got a picture of George
Washington on it.

00:33:20.950 --> 00:33:22.740
And I can look all
over the class.

00:33:22.740 --> 00:33:26.630
And if I use this is as a face
detector, I'd be sorely

00:33:26.630 --> 00:33:27.100
disappointed.

00:33:27.100 --> 00:33:29.430
Because I wouldn't
find any faces.

00:33:29.430 --> 00:33:32.350
Because thank God, nobody looks
exactly like George

00:33:32.350 --> 00:33:32.890
Washington.

00:33:32.890 --> 00:33:36.700
So the correlation wouldn't
work very well.

00:33:36.700 --> 00:33:37.950
So that idea's a loser.

00:33:41.580 --> 00:33:42.290
But wait a minute.

00:33:42.290 --> 00:33:45.250
I don't have to look
for the whole face.

00:33:45.250 --> 00:33:50.790
I could just look for eyes.

00:33:50.790 --> 00:33:53.770
And then I could look for
noses and maybe mouths.

00:33:53.770 --> 00:33:57.080
And maybe I could have a library
of 10 different eyes

00:33:57.080 --> 00:34:01.280
and 10 different noses and
10 different mouths.

00:34:01.280 --> 00:34:02.540
Would that idea work?

00:34:06.100 --> 00:34:07.440
Probably not so well.

00:34:07.440 --> 00:34:09.420
The trouble with that
one is, I'd find

00:34:09.420 --> 00:34:11.676
eyeballs in every doorknob.

00:34:11.676 --> 00:34:17.960
There's just not enough stuff
there to give me a reliable

00:34:17.960 --> 00:34:19.210
correlation.

00:34:20.920 --> 00:34:23.989
So let's make this a little
more concrete by

00:34:23.989 --> 00:34:25.239
drawing some pictures.

00:34:29.770 --> 00:34:32.880
Halloween is approaching.

00:34:32.880 --> 00:34:35.174
So here's a face.

00:34:42.387 --> 00:34:44.375
All right?

00:34:44.375 --> 00:34:45.866
Here's another face.

00:34:55.840 --> 00:34:59.160
So those might be faces in my
pre-recorded library of

00:34:59.160 --> 00:35:01.410
pumpkin faces.

00:35:01.410 --> 00:35:02.660
Now along comes this face.

00:35:13.690 --> 00:35:16.270
What's going to happen?

00:35:16.270 --> 00:35:18.490
Well, I don't know.

00:35:18.490 --> 00:35:20.200
Let's draw yet another face.

00:35:32.020 --> 00:35:33.440
I don't know, that could
be a pretty weird

00:35:33.440 --> 00:35:34.460
pumpkin face, I suppose.

00:35:34.460 --> 00:35:37.000
But I mean it to be something
that doesn't look very much

00:35:37.000 --> 00:35:39.460
like a face.

00:35:39.460 --> 00:35:44.380
So if I'm doing a complete
correlation with either of

00:35:44.380 --> 00:35:47.280
these faces in my library,
neither one will match this

00:35:47.280 --> 00:35:48.530
one very well.

00:35:51.150 --> 00:35:55.800
If I'm looking for fine features
like eyes, then I've

00:35:55.800 --> 00:36:01.300
got these eyes everywhere.

00:36:01.300 --> 00:36:04.190
So it doesn't help very much.

00:36:04.190 --> 00:36:05.380
So you can see where
I'm going.

00:36:05.380 --> 00:36:10.030
And you can reinvent Ullman's
great idea.

00:36:10.030 --> 00:36:11.960
What is it?

00:36:11.960 --> 00:36:15.200
We don't look for big features,
like whole faces.

00:36:15.200 --> 00:36:16.970
We don't look for
small features,

00:36:16.970 --> 00:36:18.846
like individual eyes.

00:36:18.846 --> 00:36:22.180
We look for intermediate
features, like two eyes and a

00:36:22.180 --> 00:36:25.040
nose, or a mouth and a nose.

00:36:25.040 --> 00:36:34.310
So when we do that, then we
can say, now, here are two

00:36:34.310 --> 00:36:37.120
eyes and a nose.

00:36:37.120 --> 00:36:38.520
Well, that's found
in this one.

00:36:42.370 --> 00:36:48.460
And what about the combination
of that nose and that mouth?

00:36:48.460 --> 00:36:51.051
Oh, that's over here.

00:36:51.051 --> 00:36:53.030
But neither of those features
can be found

00:36:53.030 --> 00:36:56.800
in the fourth image.

00:36:56.800 --> 00:36:59.410
So that's the Goldilocks
principle.

00:36:59.410 --> 00:37:00.945
When you're doing this sort of
thing, you want things that

00:37:00.945 --> 00:37:03.500
are not too small
and not too big.

00:37:03.500 --> 00:37:07.740
I've got the Rumpelstiltskin
principle up

00:37:07.740 --> 00:37:08.970
there, too, by the way.

00:37:08.970 --> 00:37:12.020
Because I meant to mention
that Marr was a genius at

00:37:12.020 --> 00:37:13.830
naming things.

00:37:13.830 --> 00:37:18.650
And even though many of his
theories have faded, he's

00:37:18.650 --> 00:37:21.520
still known for these names like
primal sketch and two and

00:37:21.520 --> 00:37:23.030
a half D sketch because
he was such an artist

00:37:23.030 --> 00:37:24.610
at naming the concepts.

00:37:24.610 --> 00:37:27.900
He even got credit for a lot
of stuff that he didn't do.

00:37:27.900 --> 00:37:33.440
Not because he was deliberately
trying to get it

00:37:33.440 --> 00:37:35.240
inappropriately, but just
because he was so good at

00:37:35.240 --> 00:37:36.490
naming stuff.

00:37:36.490 --> 00:37:38.450
So we had the Rumpelstiltskin
principle back then.

00:37:38.450 --> 00:37:40.050
And now we have the Goldilocks
principle.

00:37:40.050 --> 00:37:43.535
Not too big, not too small.

00:37:43.535 --> 00:37:48.150
But that leaves us with the
final question, which is, so

00:37:48.150 --> 00:37:51.230
if what we want to do is look
for intermediate-size

00:37:51.230 --> 00:37:54.410
features, how do we actually
find them in a sea

00:37:54.410 --> 00:37:55.770
of faces out there?

00:37:55.770 --> 00:37:58.400
See, I might have a library,
I might take 10 of you and

00:37:58.400 --> 00:38:01.050
record your eyes.

00:38:01.050 --> 00:38:03.530
Take another ten, record
your mouths.

00:38:03.530 --> 00:38:06.400
And they may be put together
in a unique way for each of

00:38:06.400 --> 00:38:06.870
you out there.

00:38:06.870 --> 00:38:10.390
But it's likely that I'll
fin Lana's eyes

00:38:10.390 --> 00:38:12.430
somewhere else in a crowd.

00:38:12.430 --> 00:38:16.850
And Nicola's mouth somewhere
else in a crowd.

00:38:16.850 --> 00:38:21.330
So how do we in fact go
about finding them?

00:38:21.330 --> 00:38:23.080
And I mentioned the
term correlation a

00:38:23.080 --> 00:38:24.720
couple of times now.

00:38:24.720 --> 00:38:26.500
Let me make that concrete.

00:38:31.270 --> 00:38:38.790
So let's consider a
one-dimensional face that

00:38:38.790 --> 00:38:40.040
looks like this.

00:38:47.950 --> 00:38:50.810
Which is a signal.

00:38:50.810 --> 00:38:53.640
And I'm going to consider
a one-dimensional image.

00:38:56.160 --> 00:39:04.390
And in that one-dimensional
image I've got a

00:39:04.390 --> 00:39:06.670
facsimile of the face.

00:39:06.670 --> 00:39:08.850
And the question is, what kind
of algorithm could I use to

00:39:08.850 --> 00:39:14.030
determine the offset in the
image where the face occurs?

00:39:14.030 --> 00:39:17.320
So you can see that one
possibility is you just do an

00:39:17.320 --> 00:39:25.270
integral of the signal in the
face and the signal out here

00:39:25.270 --> 00:39:29.610
over the extent of the face and
see how it multiplies out.

00:39:29.610 --> 00:39:34.920
Or, to make it less lawyerly
and more MITish, let's say

00:39:34.920 --> 00:39:41.310
that what we're going to do is
we're going to maximize over

00:39:41.310 --> 00:39:52.190
some parameter x the integral
over x of some face, which is

00:39:52.190 --> 00:40:04.220
a function of x and the image
g, which is a function of x

00:40:04.220 --> 00:40:07.830
minus that offset.

00:40:07.830 --> 00:40:14.200
So when the offset, t, is equal
to this offset, then

00:40:14.200 --> 00:40:17.350
we're essentially multiplying
the thing by itself and

00:40:17.350 --> 00:40:19.890
integrating over the
extent of the face.

00:40:19.890 --> 00:40:24.610
And that gives you a very big
number if they're lined up and

00:40:24.610 --> 00:40:27.420
a very small number
if they're not.

00:40:27.420 --> 00:40:32.370
And it's even true if we
add a whole lot of

00:40:32.370 --> 00:40:37.130
noise to the images.

00:40:37.130 --> 00:40:38.660
But these are images.

00:40:38.660 --> 00:40:39.595
They're not one dimensional.

00:40:39.595 --> 00:40:41.210
But that's OK.

00:40:41.210 --> 00:40:44.215
It's easy enough to make
a modification here.

00:40:44.215 --> 00:40:46.980
We're going to maximize
over translation

00:40:46.980 --> 00:40:49.140
parameters x and y.

00:40:49.140 --> 00:40:51.970
And these are no longer
functions of just x, they're

00:40:51.970 --> 00:40:53.220
also functions of y.

00:40:56.900 --> 00:40:59.750
Like so.

00:40:59.750 --> 00:41:01.380
So that's basically
how it works.

00:41:01.380 --> 00:41:03.690
We won't go into details about
normalization and all that

00:41:03.690 --> 00:41:06.480
sort of thing because that's
the stuff of which other

00:41:06.480 --> 00:41:08.825
courses remain the custodians.

00:41:11.340 --> 00:41:13.412
So would you like to see
a demonstration?

00:41:13.412 --> 00:41:14.662
OK.

00:41:36.410 --> 00:41:37.220
All right.

00:41:37.220 --> 00:41:42.000
So without realizing it, Nicola
and Erica have loaned

00:41:42.000 --> 00:41:44.080
us their pictures.

00:41:44.080 --> 00:41:49.490
And they are embedded in that
big field of noise.

00:41:49.490 --> 00:41:52.170
And it's pretty easy to pick out
Erica and Nicola, right?

00:41:52.170 --> 00:41:57.120
Because we are actually pretty
good at picking faces out of

00:41:57.120 --> 00:41:58.740
these images.

00:41:58.740 --> 00:42:01.220
So let's add some noise.

00:42:05.640 --> 00:42:08.160
It's a little harder now.

00:42:08.160 --> 00:42:10.100
What I'm going to is I'm going
to run this correlation

00:42:10.100 --> 00:42:18.000
program over this whole image
using Nicola's face as a mask

00:42:18.000 --> 00:42:20.670
and seeing where the correlation
peaks up, in spite

00:42:20.670 --> 00:42:21.920
of all the noise that's
in there.

00:42:28.290 --> 00:42:29.540
Boom, there he is.

00:42:32.780 --> 00:42:34.820
I don't know, maybe we
can find Erica too.

00:42:37.370 --> 00:42:40.110
I forgot where she was.

00:42:40.110 --> 00:42:41.360
I can't find her.

00:42:44.740 --> 00:42:47.670
There she is.

00:42:47.670 --> 00:42:50.490
Unfortunately the parameters
aren't very good here.

00:42:50.490 --> 00:42:52.890
Do you see that?

00:42:52.890 --> 00:42:55.550
Let me get another
version of this.

00:42:55.550 --> 00:42:59.520
I'll just do some real-time
programming.

00:43:08.780 --> 00:43:13.210
I've been trying to reset the
parameters so that the images

00:43:13.210 --> 00:43:17.070
in the demonstration comes
out clearly up there.

00:43:17.070 --> 00:43:19.680
Let's see if this works
a little better.

00:43:19.680 --> 00:43:20.990
OK, so let's add some noise.

00:43:23.860 --> 00:43:25.630
And let's find Erica.

00:43:28.750 --> 00:43:30.000
There she is.

00:43:32.340 --> 00:43:33.290
There are some other
things that look a

00:43:33.290 --> 00:43:34.550
little bit like Erica.

00:43:34.550 --> 00:43:36.800
But nothing looks quite
exactly like Erica.

00:43:39.450 --> 00:43:42.245
So let's try Nicola's eyes.

00:43:46.090 --> 00:43:48.070
So they stand out pretty
clearly against the

00:43:48.070 --> 00:43:49.630
background.

00:43:49.630 --> 00:43:51.280
Let's see if we can
find Erica's eyes.

00:43:54.580 --> 00:43:56.150
So they stand out pretty
clearly against the

00:43:56.150 --> 00:43:56.440
background.

00:43:56.440 --> 00:43:59.300
Notice that it also gets
Nicola's eyes.

00:43:59.300 --> 00:44:04.840
So two eyes is an
intermediate-size constraint.

00:44:04.840 --> 00:44:08.780
It's loose enough that it will
match more than one person.

00:44:08.780 --> 00:44:12.130
But it's not so loose
that it's as bad as

00:44:12.130 --> 00:44:15.050
looking for one eye.

00:44:15.050 --> 00:44:17.490
See, they're all
over the place.

00:44:17.490 --> 00:44:21.640
So two eyes and a nose, a mouth
and a nose, that would

00:44:21.640 --> 00:44:23.870
be even better as an
intermediate feature.

00:44:23.870 --> 00:44:25.875
But it doesn't matter what the
best ones are, because you can

00:44:25.875 --> 00:44:28.620
work that out experimentally.

00:44:28.620 --> 00:44:30.660
So that's how correlation
works.

00:44:30.660 --> 00:44:34.690
And it's just amazing how much
noise you can add and it'll

00:44:34.690 --> 00:44:36.500
still pick out the
right stuff.

00:44:39.160 --> 00:44:39.780
There's Nicola.

00:44:39.780 --> 00:44:41.080
Boom.

00:44:41.080 --> 00:44:43.290
Very clear.

00:44:43.290 --> 00:44:46.790
Want to add some more noise?

00:44:46.790 --> 00:44:49.640
I don't know, I can see it,
but that's because I'm a

00:44:49.640 --> 00:44:50.910
pretty good correlator, too.

00:44:55.300 --> 00:44:56.500
Boom.

00:44:56.500 --> 00:44:57.930
I don't know, let's add
some more noise.

00:45:04.420 --> 00:45:06.480
It's just hard to
get rid of it.

00:45:06.480 --> 00:45:09.650
It's just amazing how well
it picks it out.

00:45:09.650 --> 00:45:10.400
That's good.

00:45:10.400 --> 00:45:12.640
That's cool.

00:45:12.640 --> 00:45:16.690
Now, but the reason that this
is 30 years and we're still

00:45:16.690 --> 00:45:19.340
not done is there are still
some questions.

00:45:19.340 --> 00:45:22.730
This is recognizing
stuff straight on.

00:45:22.730 --> 00:45:25.260
How is it I can recognize you
in the hall from the side?

00:45:25.260 --> 00:45:27.600
Nobody knows.

00:45:27.600 --> 00:45:31.830
One possibility is that you have
an ability to make those

00:45:31.830 --> 00:45:32.630
transformations.

00:45:32.630 --> 00:45:37.410
If so, then that alignment
theory has a role to play.

00:45:37.410 --> 00:45:41.360
Another theory is that, well,
after I've seen you once I can

00:45:41.360 --> 00:45:44.680
watch you turn your head and
keep recording what you look

00:45:44.680 --> 00:45:47.220
like at all possible angles.

00:45:47.220 --> 00:45:48.480
That would work.

00:45:48.480 --> 00:45:52.150
The trouble is, is there
enough stuff in there?

00:45:52.150 --> 00:45:52.650
Maybe.

00:45:52.650 --> 00:45:53.900
We don't know.

00:45:55.770 --> 00:45:59.040
Now what would it take to
break this mechanism?

00:45:59.040 --> 00:45:59.910
Well, I don't know.

00:45:59.910 --> 00:46:01.200
Let's just see if we can
break the mechanism.

00:46:08.600 --> 00:46:11.815
Let's see if you can recognize
some well-known faces.

00:46:15.820 --> 00:46:16.400
Who's that?

00:46:16.400 --> 00:46:17.430
Quick.

00:46:17.430 --> 00:46:18.540
STUDENT: Obama.

00:46:18.540 --> 00:46:21.290
PATRICK WINSTON: Oh,
that's too easy.

00:46:21.290 --> 00:46:23.110
We'll see if we can make
some harder ones.

00:46:25.930 --> 00:46:26.955
Yeah, that's Obama.

00:46:26.955 --> 00:46:28.280
Who's this?

00:46:28.280 --> 00:46:29.600
STUDENT: Bush.

00:46:29.600 --> 00:46:29.940
PATRICK WINSTON: Oh boy.

00:46:29.940 --> 00:46:31.440
You're really good at this.

00:46:31.440 --> 00:46:32.200
That's Bush.

00:46:32.200 --> 00:46:32.960
How about this guy?

00:46:32.960 --> 00:46:34.210
STUDENT: Kerry.

00:46:38.680 --> 00:46:39.000
PATRICK WINSTON: OK.

00:46:39.000 --> 00:46:39.780
Now I've got it.

00:46:39.780 --> 00:46:41.220
Some people are starting
to turn their heads.

00:46:41.220 --> 00:46:42.722
And that's not fair.

00:46:42.722 --> 00:46:44.200
[LAUGHTER]

00:46:44.200 --> 00:46:46.030
PATRICK WINSTON: That's
not fair.

00:46:46.030 --> 00:46:49.070
Because you see what's happened
is that if this kind

00:46:49.070 --> 00:46:52.500
of pumpkin in theory is correct,
then when you turn

00:46:52.500 --> 00:46:56.020
the face upside down you lose
the correlation of those

00:46:56.020 --> 00:46:58.740
features that have vertical
components.

00:46:58.740 --> 00:47:01.570
So if you have two eyes and a
nose, they won't match two

00:47:01.570 --> 00:47:04.890
eyes and a nose when they're
turned upside down.

00:47:04.890 --> 00:47:05.590
Well, let's see.

00:47:05.590 --> 00:47:08.470
We'll try some more.

00:47:08.470 --> 00:47:10.514
Who's that?

00:47:10.514 --> 00:47:11.370
STUDENT: Gorbachev.

00:47:11.370 --> 00:47:11.800
PATRICK WINSTON: Gorbachev.

00:47:11.800 --> 00:47:13.120
Who said that?

00:47:13.120 --> 00:47:14.575
Leonid, where are you?

00:47:14.575 --> 00:47:15.430
This is Gorbachev, right?

00:47:15.430 --> 00:47:18.340
You can recognize him because of
the little birthmark on the

00:47:18.340 --> 00:47:20.150
top of his head.

00:47:20.150 --> 00:47:21.105
One more.

00:47:21.105 --> 00:47:22.275
Who's--

00:47:22.275 --> 00:47:23.140
oh, that's easy.

00:47:23.140 --> 00:47:26.050
Who is it?

00:47:26.050 --> 00:47:28.050
That's Clinton.

00:47:28.050 --> 00:47:29.300
How about this one?

00:47:34.520 --> 00:47:39.000
Do you see how insulting
it is to be at MIT?

00:47:39.000 --> 00:47:40.076
That's me.

00:47:40.076 --> 00:47:43.480
[LAUGHTER]

00:47:43.480 --> 00:47:46.720
PATRICK WINSTON: And you
didn't even know.

00:47:46.720 --> 00:47:47.970
Oh, god.

00:47:52.770 --> 00:47:57.700
So this might be evidence for
the correlation theory.

00:47:57.700 --> 00:48:01.280
But of course, turning the face
upside down would make it

00:48:01.280 --> 00:48:02.860
very difficult to do
alignment, too.

00:48:02.860 --> 00:48:06.060
So it would break out alignment
theory, as well.

00:48:06.060 --> 00:48:09.045
Let me get that after class,
Was there a mistake, or?

00:48:09.045 --> 00:48:09.500
STUDENT: No, no.

00:48:09.500 --> 00:48:13.430
I was just curious [INAUDIBLE]
stretching would break the

00:48:13.430 --> 00:48:14.140
correlation.

00:48:14.140 --> 00:48:15.461
PATRICK WINSTON: If what would
break the structure?

00:48:15.461 --> 00:48:16.383
What?

00:48:16.383 --> 00:48:16.844
Stretching?

00:48:16.844 --> 00:48:18.094
STUDENT: [INAUDIBLE].

00:48:20.120 --> 00:48:21.800
PATRICK WINSTON: Elliot asked if
stretching would break the

00:48:21.800 --> 00:48:22.970
correlation.

00:48:22.970 --> 00:48:30.800
And the answer is, I think,
stretching in the vertical

00:48:30.800 --> 00:48:33.290
dimension is worse than
stretching in

00:48:33.290 --> 00:48:34.790
the horizontal dimension.

00:48:34.790 --> 00:48:36.455
Because you get a certain amount
of stretching in the

00:48:36.455 --> 00:48:38.700
horizontal dimension when
you just turn your head.

00:48:38.700 --> 00:48:41.450
By the way, since our faces
are basically mounted on a

00:48:41.450 --> 00:48:45.550
cylinder, this kind
of transformation

00:48:45.550 --> 00:48:46.890
might actually work.

00:48:46.890 --> 00:48:51.140
That's a sidebar to the answer
to your question, Elliot.

00:48:51.140 --> 00:48:53.730
But now you say, well, OK, so
this is not completely solved.

00:48:53.730 --> 00:48:55.980
You can work this out.

00:48:55.980 --> 00:48:59.430
But if you really want to work
something out, let me tell you

00:48:59.430 --> 00:49:03.570
what the current questions
are in computer vision.

00:49:03.570 --> 00:49:05.090
People have worked for an
awful long time on this

00:49:05.090 --> 00:49:16.010
recognition stuff and, to my
mind, have neglected the more

00:49:16.010 --> 00:49:18.900
serious questions.

00:49:18.900 --> 00:49:21.030
It's more serious questions
are, how do you visually

00:49:21.030 --> 00:49:24.150
determine what's happening?

00:49:24.150 --> 00:49:28.280
If you could write a program
that would reliably determine

00:49:28.280 --> 00:49:31.520
when these verbs are happening
in your field of view, I will

00:49:31.520 --> 00:49:32.970
sign your Ph.D. thesis
tomorrow.

00:49:32.970 --> 00:49:35.680
There are 48 of them there.

00:49:35.680 --> 00:49:37.610
And that is today's challenge.

00:49:37.610 --> 00:49:40.630
But since we're short on time,
I want to skip over that and

00:49:40.630 --> 00:49:42.800
perform an experiment on you.

00:49:42.800 --> 00:49:44.892
I want you to tell me
what I'm doing.

00:49:44.892 --> 00:49:46.600
STUDENT: [INAUDIBLE].

00:49:46.600 --> 00:49:49.960
PATRICK WINSTON: So the best
single-word answer is?

00:49:49.960 --> 00:49:51.090
[INAUDIBLE]?

00:49:51.090 --> 00:49:51.490
STUDENT: Drinking.

00:49:51.490 --> 00:49:54.020
PATRICK WINSTON: OK, this
is not a trick question.

00:49:54.020 --> 00:49:56.500
OK, the best single-word
answer.

00:49:56.500 --> 00:49:57.778
Christopher, what
do you think?

00:49:57.778 --> 00:49:59.272
STUDENT: Toasting.

00:49:59.272 --> 00:50:00.770
PATRICK WINSTON: Christopher.

00:50:00.770 --> 00:50:02.942
Well, you.

00:50:02.942 --> 00:50:04.874
You.

00:50:04.874 --> 00:50:06.330
STUDENT: Toasting.

00:50:06.330 --> 00:50:07.910
PATRICK WINSTON: What?

00:50:07.910 --> 00:50:09.298
Toasting.

00:50:09.298 --> 00:50:09.782
OK.

00:50:09.782 --> 00:50:12.690
Not a trick question.

00:50:12.690 --> 00:50:13.940
What's happening here?

00:50:18.066 --> 00:50:20.878
Best single-word answer?

00:50:20.878 --> 00:50:21.786
STUDENT: Drinking.

00:50:21.786 --> 00:50:24.060
PATRICK WINSTON: Is drinking.

00:50:24.060 --> 00:50:25.846
Which pair look more alike?

00:50:25.846 --> 00:50:32.170
[LAUGHTER]

00:50:32.170 --> 00:50:34.210
PATRICK WINSTON: So that cat is
drinking and nobody has any

00:50:34.210 --> 00:50:35.500
trouble recognizing that.

00:50:35.500 --> 00:50:43.280
And I believe it's because
you're telling a story.

00:50:43.280 --> 00:50:46.260
So our power of storytelling
even reaches down into our

00:50:46.260 --> 00:50:47.620
visual apparatus.

00:50:47.620 --> 00:50:52.720
So the story here is that some
animal has evidently had an

00:50:52.720 --> 00:50:56.860
urge to find something to drink
and water is passing

00:50:56.860 --> 00:50:57.920
through that animal's mouth.

00:50:57.920 --> 00:50:59.720
That's the drinking story.

00:50:59.720 --> 00:51:02.520
So even though they look
enormously different visually,

00:51:02.520 --> 00:51:05.360
the stuff at the bottom of our
vision system provides enough

00:51:05.360 --> 00:51:08.910
evidence for our story apparatus
so that we can give

00:51:08.910 --> 00:51:12.300
the left one and the right one
different labels and recognize

00:51:12.300 --> 00:51:13.550
the cat is drinking.

00:51:16.410 --> 00:51:17.950
And that's the end
of the story.