WEBVTT

00:00:01.640 --> 00:00:04.040
The following content is
provided under a Creative

00:00:04.040 --> 00:00:05.580
Commons license.

00:00:05.580 --> 00:00:07.880
Your support will help
MIT OpenCourseWare

00:00:07.880 --> 00:00:12.270
continue to offer high quality
educational resources for free.

00:00:12.270 --> 00:00:14.870
To make a donation or
view additional materials

00:00:14.870 --> 00:00:18.830
from hundreds of MIT courses,
visit MIT OpenCourseWare

00:00:18.830 --> 00:00:21.839
at ocw.mit.edu.

00:00:21.839 --> 00:00:22.880
JOHN LEONARD: OK, thanks.

00:00:22.880 --> 00:00:24.338
Thanks for the
opportunity to talk.

00:00:24.338 --> 00:00:25.940
So hi, everyone.

00:00:25.940 --> 00:00:27.770
It's a great pleasure
to talk here at MBL.

00:00:27.770 --> 00:00:29.769
I've been coming to the
Woods Hole Oceanographic

00:00:29.769 --> 00:00:33.440
Institution for many years as
my first thing over here at MBL.

00:00:33.440 --> 00:00:37.370
And so I'm going to try to cover
three different topics, which

00:00:37.370 --> 00:00:39.110
is probably a little
ambitious on time.

00:00:39.110 --> 00:00:41.420
But there's so much
I'd love to say to you.

00:00:41.420 --> 00:00:43.370
I want to talk about
self-driving cars.

00:00:43.370 --> 00:00:46.850
And use it as a context
to think about questions

00:00:46.850 --> 00:00:49.910
of representation for
localization and mapping,

00:00:49.910 --> 00:00:52.550
and maybe connect it into
some of the brain questions

00:00:52.550 --> 00:00:54.046
that you folks
are interested in,

00:00:54.046 --> 00:00:55.670
and time permitting,
at the end mention

00:00:55.670 --> 00:00:58.640
a little bit of work we've
done on object-based mapping

00:00:58.640 --> 00:00:59.600
in my lab.

00:00:59.600 --> 00:01:03.180
So my background-- I
grew up in Philadelphia.

00:01:03.180 --> 00:01:05.540
Went to UPenn for engineering.

00:01:05.540 --> 00:01:08.660
But then went to Oxford to do
my PhD at a very exciting time

00:01:08.660 --> 00:01:10.910
when their computer vision
and robotics group was just

00:01:10.910 --> 00:01:13.700
being formed at Oxford
under Michael Brady.

00:01:13.700 --> 00:01:16.190
And then I came back to
MIT and started working

00:01:16.190 --> 00:01:17.270
with underwater vehicles.

00:01:17.270 --> 00:01:19.850
And that's when I got involved
with Woods Hole Oceanographic

00:01:19.850 --> 00:01:20.990
Institution.

00:01:20.990 --> 00:01:23.390
And I was very fortunate
to join the AI lab back

00:01:23.390 --> 00:01:27.710
around 2002, which
became part of CSAIL.

00:01:27.710 --> 00:01:29.390
And really, I've
been able to work

00:01:29.390 --> 00:01:32.450
with really amazing
colleagues and amazing robots

00:01:32.450 --> 00:01:35.100
in a challenging
set of environments.

00:01:35.100 --> 00:01:37.250
So autonomous
underwater vehicles

00:01:37.250 --> 00:01:42.020
provide a very unique challenge
because we have very poor

00:01:42.020 --> 00:01:43.170
communications to them.

00:01:43.170 --> 00:01:44.720
Typically, we use
acoustic modems

00:01:44.720 --> 00:01:49.040
that might give you 96 bytes if
you're lucky every 10 seconds

00:01:49.040 --> 00:01:50.810
to a few kilometers range.

00:01:50.810 --> 00:01:55.280
And so we also need to think
about the sort of constraints

00:01:55.280 --> 00:01:57.860
of running in real
time onboard a vehicle.

00:01:57.860 --> 00:02:02.690
And so the sort of work
that my lab's done--

00:02:02.690 --> 00:02:04.970
the more we investigate
more fundamental questions

00:02:04.970 --> 00:02:07.640
about robot perception,
navigation, and mapping,

00:02:07.640 --> 00:02:09.780
we also are involved
in building systems.

00:02:09.780 --> 00:02:13.310
So this is a project I did for
the Office of Naval Research

00:02:13.310 --> 00:02:16.310
some years ago using
small vehicles that

00:02:16.310 --> 00:02:18.410
would reacquire
mine-like targets

00:02:18.410 --> 00:02:19.910
on the bottom for the Navy.

00:02:19.910 --> 00:02:22.100
And so this is an example
of a more applied system

00:02:22.100 --> 00:02:24.710
where we had a very small
resource-constrained platform.

00:02:24.710 --> 00:02:27.050
And the sort of work we
did is a robot built a map

00:02:27.050 --> 00:02:29.300
as it performed its
mission, and then matched

00:02:29.300 --> 00:02:31.620
the map against
the prior map to do

00:02:31.620 --> 00:02:33.766
terminal guidance to a target.

00:02:33.766 --> 00:02:35.390
Another big system
I was involved with,

00:02:35.390 --> 00:02:37.667
as Russ mentioned, was
the Urban Challenge.

00:02:37.667 --> 00:02:39.500
And I'll say a bit about
that in the context

00:02:39.500 --> 00:02:41.510
of self-driving cars.

00:02:41.510 --> 00:02:43.670
So let's see.

00:02:43.670 --> 00:02:47.780
So who's heard any of the
recent statements from Elon Musk

00:02:47.780 --> 00:02:49.160
from Tesla?

00:02:49.160 --> 00:02:55.630
So he said self-driving
cars are solved- he said.

00:02:55.630 --> 00:02:59.892
And a particular thing that
he said that just made my--

00:02:59.892 --> 00:03:01.850
I don't know, maybe steam
came out of my head--

00:03:01.850 --> 00:03:04.520
was that he compared
autonomous cars

00:03:04.520 --> 00:03:06.709
with elevators that used
to require operators

00:03:06.709 --> 00:03:07.750
but are now self-service.

00:03:07.750 --> 00:03:10.520
So imagine you getting in
a car, pressing a button,

00:03:10.520 --> 00:03:14.150
and arriving at MIT in
Cambridge 80 miles away,

00:03:14.150 --> 00:03:18.110
navigating through
the Boston downtown

00:03:18.110 --> 00:03:19.626
highways and intersections.

00:03:19.626 --> 00:03:20.750
And maybe that will happen.

00:03:20.750 --> 00:03:23.010
But I think it's going to
take a lot longer than folks

00:03:23.010 --> 00:03:23.510
are saying.

00:03:23.510 --> 00:03:25.760
And some of that comes from
just fundamental questions

00:03:25.760 --> 00:03:28.400
and intelligence and robotics.

00:03:28.400 --> 00:03:31.580
So in a nutshell, when Musk
says that self-driving is solved

00:03:31.580 --> 00:03:35.480
I think he's wrong, as
much as I admire what

00:03:35.480 --> 00:03:37.720
Tesla and SpaceX have done.

00:03:37.720 --> 00:03:40.010
And so to talk
about that, I think

00:03:40.010 --> 00:03:42.620
we need to be very honest as
a field about our failures

00:03:42.620 --> 00:03:44.060
as well as our
successes, and try

00:03:44.060 --> 00:03:47.690
to balance what you hear in the
media with the reality of where

00:03:47.690 --> 00:03:49.340
I think we are.

00:03:49.340 --> 00:03:51.890
And so I wanted
to quote verbatim

00:03:51.890 --> 00:03:55.470
what Russ said about
the robotics challenge,

00:03:55.470 --> 00:03:59.390
about a project that was
so exhausting and just

00:03:59.390 --> 00:04:03.100
all-consuming and so
stressful, yet so rewarding.

00:04:03.100 --> 00:04:06.200
So we did this in
2006 and 2007--

00:04:06.200 --> 00:04:08.210
my wonderful colleagues,
Seth Teller, John Howe,

00:04:08.210 --> 00:04:09.350
Amelia Fratoli--

00:04:09.350 --> 00:04:11.170
amazing students and postdocs.

00:04:11.170 --> 00:04:13.040
We had a very large team.

00:04:13.040 --> 00:04:14.810
And we tried to push
the limit on what

00:04:14.810 --> 00:04:17.970
was possible with perception
and real-time motion planning.

00:04:17.970 --> 00:04:20.209
So our vehicle built a
local map as it traveled

00:04:20.209 --> 00:04:23.820
from its perceptual data,
using data from laser scanners

00:04:23.820 --> 00:04:25.070
and cameras.

00:04:25.070 --> 00:04:27.410
And we didn't want to
blindly follow GPS.

00:04:27.410 --> 00:04:29.630
We wanted the car to
make its own decisions

00:04:29.630 --> 00:04:31.929
because GPS navigation was
part of the original quest

00:04:31.929 --> 00:04:32.720
with the challenge.

00:04:32.720 --> 00:04:35.570
And so Seth Teller and
his student, Albert Wang,

00:04:35.570 --> 00:04:38.380
developed a vision-based
perceptual system

00:04:38.380 --> 00:04:42.260
where the car tried to detect
from curbs and lane markings

00:04:42.260 --> 00:04:44.450
in very challenging
vision conditions.

00:04:44.450 --> 00:04:46.670
For example, looking
into the sun, which

00:04:46.670 --> 00:04:47.720
you'll see in a second--

00:04:47.720 --> 00:04:51.770
really challenging situation for
trying to perceive the world.

00:04:51.770 --> 00:04:53.120
And so our vehicle--

00:04:53.120 --> 00:04:56.480
at the time, we went a little
crazy on the computation.

00:04:56.480 --> 00:05:00.500
We had 10 blades,
each with four cores--

00:05:00.500 --> 00:05:04.190
40 cores-- which may not seem
a lot now, but we needed 3.5

00:05:04.190 --> 00:05:07.340
kilowatts just to power
the computer at full tilt.

00:05:07.340 --> 00:05:09.650
We fully loaded the computer
with a randomized motion

00:05:09.650 --> 00:05:11.870
planner, with all these
perception algorithms.

00:05:11.870 --> 00:05:14.240
We had a Velodyne laser
scanner on the roof.

00:05:14.240 --> 00:05:19.700
And about 12 other laser
scanners, 5 cameras, 15 radars,

00:05:19.700 --> 00:05:22.710
and we really pushed the
envelope on algorithms.

00:05:22.710 --> 00:05:25.430
And so when faced with a
choice in a DARPA challenge,

00:05:25.430 --> 00:05:27.010
if you want to win
at all costs you

00:05:27.010 --> 00:05:29.135
might simplify, or try to
read the rules carefully,

00:05:29.135 --> 00:05:30.782
or guess the rule
simplifications.

00:05:30.782 --> 00:05:32.240
But that would have
meant just sort

00:05:32.240 --> 00:05:34.280
of turning off the work
of our PhD students,

00:05:34.280 --> 00:05:35.730
and we didn't want to do that.

00:05:35.730 --> 00:05:37.897
So at the end of the day,
all credit to the teams

00:05:37.897 --> 00:05:38.480
that did well.

00:05:38.480 --> 00:05:40.610
Carnegie Mellon--
first, $2 million,

00:05:40.610 --> 00:05:43.610
Stanford-- second, $1 million,
Virginia Tech-- third,

00:05:43.610 --> 00:05:45.830
half a million
dollars, MIT-- fourth,

00:05:45.830 --> 00:05:47.770
and nothing for fourth place.

00:05:47.770 --> 00:05:51.470
But it was quite an
amazing experience.

00:05:51.470 --> 00:05:54.386
And in the spirit of
advertising our failures

00:05:54.386 --> 00:05:55.760
I think I have
time to show this.

00:05:55.760 --> 00:05:57.440
This used to be painful
for me to watch.

00:05:57.440 --> 00:05:59.239
But now I've gotten over it.

00:05:59.239 --> 00:05:59.780
This is our--

00:05:59.780 --> 00:06:00.446
[VIDEO PLAYBACK]

00:06:00.446 --> 00:06:04.010
- Let's check in once
again with the boss.

00:06:04.010 --> 00:06:06.170
JOHN LEONARD: Even though
we finished the race,

00:06:06.170 --> 00:06:08.180
we had a few incidents so
DARPA stopped things and let

00:06:08.180 --> 00:06:08.680
us continue.

00:06:08.680 --> 00:06:09.514
- --across the line.

00:06:09.514 --> 00:06:11.513
JOHN LEONARD: Carnegie-Mellon,
who won the race.

00:06:11.513 --> 00:06:12.280
Why did that stop?

00:06:12.280 --> 00:06:13.626
Let's see.

00:06:13.626 --> 00:06:17.330
- --at the end of mission
two behind Virginia Tech.

00:06:17.330 --> 00:06:21.087
Virginia Tech got
a little issue.

00:06:21.087 --> 00:06:21.920
[INAUDIBLE] Here's--

00:06:21.920 --> 00:06:24.545
JOHN LEONARD: We were trying to
pass Cornell for a few minutes.

00:06:24.545 --> 00:06:26.380
- Looks like they're stopped.

00:06:26.380 --> 00:06:29.140
And it looks like they're--

00:06:29.140 --> 00:06:31.550
that the 79 is trying
to pass and has

00:06:31.550 --> 00:06:35.791
passed the chase vehicle
for Skynet, the 26 vehicle.

00:06:35.791 --> 00:06:36.290
Wow.

00:06:36.290 --> 00:06:37.450
And now he's done it.

00:06:37.450 --> 00:06:39.020
And Talos is going to pass.

00:06:39.020 --> 00:06:40.680
Very aggressive.

00:06:40.680 --> 00:06:42.256
And, whoa.

00:06:42.256 --> 00:06:43.120
Ohh.

00:06:43.120 --> 00:06:45.370
We had our first collision.

00:06:45.370 --> 00:06:47.380
Crash in turn one.

00:06:47.380 --> 00:06:48.730
Oh boy.

00:06:48.730 --> 00:06:51.030
That is, you know,
that's a bold maneuver.

00:06:54.376 --> 00:06:55.130
[END PLAYBACK]

00:06:55.130 --> 00:06:58.090
JOHN LEONARD: So what
actually happened?

00:06:58.090 --> 00:06:59.630
So it turned out
Cornell were having

00:06:59.630 --> 00:07:00.770
problems with their actuators.

00:07:00.770 --> 00:07:02.420
They were sort of stopping
and starting and stopping

00:07:02.420 --> 00:07:02.961
and starting.

00:07:02.961 --> 00:07:04.910
And we had some problems.

00:07:04.910 --> 00:07:06.480
It turned out we
had about five bugs.

00:07:06.480 --> 00:07:08.750
They had about five
bugs that interacted.

00:07:08.750 --> 00:07:10.452
And here's a computer's eye--

00:07:10.452 --> 00:07:11.910
sort of, brain of
the robot's view.

00:07:11.910 --> 00:07:14.750
Now back in '07, we weren't
using a lot of vision

00:07:14.750 --> 00:07:17.240
for object detection
and classification.

00:07:17.240 --> 00:07:18.680
So with the laser scanner--

00:07:18.680 --> 00:07:20.510
the Cornell vehicle's there.

00:07:20.510 --> 00:07:21.519
It has a license plate.

00:07:21.519 --> 00:07:22.310
It has tail lights.

00:07:22.310 --> 00:07:23.269
It has a big number 26.

00:07:23.269 --> 00:07:24.476
It's on the middle of a road.

00:07:24.476 --> 00:07:25.680
We should know that's a car.

00:07:25.680 --> 00:07:26.504
Stay away from it.

00:07:26.504 --> 00:07:27.920
But to the laser
scanner it's just

00:07:27.920 --> 00:07:29.680
a blob of laser scanner data.

00:07:29.680 --> 00:07:33.440
And even when we pull
around the side of the car

00:07:33.440 --> 00:07:36.050
we weren't clever enough
with our algorithms

00:07:36.050 --> 00:07:37.554
to fill in the fact
that it's a car.

00:07:37.554 --> 00:07:39.470
And you have the problem
when it starts moving

00:07:39.470 --> 00:07:40.605
of the aperture problem--

00:07:40.605 --> 00:07:42.230
that as you're moving,
and it's moving,

00:07:42.230 --> 00:07:44.990
it's very hard to tell and
deduce the true motion.

00:07:44.990 --> 00:07:48.250
Now, another thing that
happened was we had a threshold.

00:07:48.250 --> 00:07:51.200
And so in our
150,000 lines of code

00:07:51.200 --> 00:07:52.790
our wonderfully
gifted student, who's

00:07:52.790 --> 00:07:55.280
now a tenured professor
at Michigan, Ed Olson,

00:07:55.280 --> 00:07:57.500
had a threshold of
3 meters per second.

00:07:57.500 --> 00:08:00.110
So anything moving faster
than 3 meters per second

00:08:00.110 --> 00:08:01.400
could be a car.

00:08:01.400 --> 00:08:03.920
Anything less than 3 meters
per second couldn't be a car.

00:08:03.920 --> 00:08:05.436
Now that might
seem kind of silly.

00:08:05.436 --> 00:08:07.310
But it turns out that
slowly moving obstacles

00:08:07.310 --> 00:08:08.960
are much harder to
detect and classify

00:08:08.960 --> 00:08:10.460
than fast moving obstacles.

00:08:10.460 --> 00:08:12.470
That's one reason that
city driving or driving,

00:08:12.470 --> 00:08:15.710
say, in a shopping mall parking
lot is actually in many ways

00:08:15.710 --> 00:08:17.990
more challenging than
driving on the highway.

00:08:17.990 --> 00:08:22.707
And so despite our best efforts
to stop at the last minute,

00:08:22.707 --> 00:08:25.040
we steered into the car and
had this little minor fender

00:08:25.040 --> 00:08:26.120
bender.

00:08:26.120 --> 00:08:28.490
But one thing that we did
is we made all our data

00:08:28.490 --> 00:08:29.480
available open source.

00:08:29.480 --> 00:08:31.105
And we actually wrote
a journal article

00:08:31.105 --> 00:08:35.159
on this incident
and a few others.

00:08:35.159 --> 00:08:37.520
And so if you'd asked
me then in 2007,

00:08:37.520 --> 00:08:39.720
I would have said we're
a long way from turning

00:08:39.720 --> 00:08:41.659
your car loose on
the streets of Boston

00:08:41.659 --> 00:08:43.580
with absolutely no user input.

00:08:43.580 --> 00:08:47.400
And the real challenge is our
uncertainty and robustness

00:08:47.400 --> 00:08:50.410
and developing robust
systems that really work.

00:08:50.410 --> 00:08:53.240
But for our system, some of the
algorithm progress we made--

00:08:53.240 --> 00:08:54.860
I mentioned the lane tracking.

00:08:54.860 --> 00:08:59.270
Albert Wang, who's now, I think,
working at Google, developed--

00:08:59.270 --> 00:09:01.070
was given very sparse--

00:09:01.070 --> 00:09:03.532
I'd say about 10% of
the recent graduates

00:09:03.532 --> 00:09:05.240
or more are working
at Google these days.

00:09:05.240 --> 00:09:06.380
AUDIENCE: Albert's
at [INAUDIBLE]..

00:09:06.380 --> 00:09:07.120
JOHN LEONARD: Oh.

00:09:07.120 --> 00:09:08.090
OK.

00:09:08.090 --> 00:09:16.100
And then here is a video
for the qualifying event

00:09:16.100 --> 00:09:17.570
to get into the final race.

00:09:17.570 --> 00:09:19.790
We had to navigate-- whoops,
I can't press the mouse.

00:09:19.790 --> 00:09:20.690
That's going to stop.

00:09:20.690 --> 00:09:26.990
So we had to navigate
along a curved road

00:09:26.990 --> 00:09:28.340
with very sparse waypoints.

00:09:28.340 --> 00:09:30.022
And so, in real
time the computer

00:09:30.022 --> 00:09:31.730
has to make decisions
about what it sees.

00:09:31.730 --> 00:09:32.480
Where is the road?

00:09:32.480 --> 00:09:33.440
Where am I?

00:09:33.440 --> 00:09:34.602
Are there obstacles?

00:09:34.602 --> 00:09:36.560
And there are no parked
cars in this situation,

00:09:36.560 --> 00:09:38.360
but other stretches
had parked cars.

00:09:38.360 --> 00:09:41.021
And our car-- in a nutshell,
if our robot became

00:09:41.021 --> 00:09:43.020
confused about where the
road was it would stop.

00:09:43.020 --> 00:09:44.930
It would have to wait
and get its courage up,

00:09:44.930 --> 00:09:47.741
like lowering its
thresholds as it was stuck.

00:09:47.741 --> 00:09:49.490
But we were the only
team to our knowledge

00:09:49.490 --> 00:09:52.530
to qualify without
actually adding waypoints.

00:09:52.530 --> 00:09:54.030
So it turns out the
other top teams,

00:09:54.030 --> 00:09:56.300
they just went in with
a Google satellite image

00:09:56.300 --> 00:09:59.280
and just added a breadcrumb
trail for the robot to follow,

00:09:59.280 --> 00:10:01.100
simplifying the perception.

00:10:01.100 --> 00:10:02.970
So this was back in '07.

00:10:02.970 --> 00:10:05.700
Now let's fast forward to 2015.

00:10:05.700 --> 00:10:09.470
And right now-- so
of course, we have

00:10:09.470 --> 00:10:11.390
the Google self-driving
car which has just

00:10:11.390 --> 00:10:13.550
been an amazing project.

00:10:13.550 --> 00:10:15.590
And you've all probably
seen these videos,

00:10:15.590 --> 00:10:17.960
each with millions
of hits on YouTube.

00:10:17.960 --> 00:10:22.460
The earlier one of taking
a blind person for a ride

00:10:22.460 --> 00:10:26.090
to Taco Bell, this was
driving-- that was 2012, city

00:10:26.090 --> 00:10:30.080
streets in 2014, spring 2015.

00:10:30.080 --> 00:10:32.960
And then the new
Google car, which

00:10:32.960 --> 00:10:36.020
won't have a steering wheel
in its final instantiation,

00:10:36.020 --> 00:10:37.237
won't have pedals.

00:10:37.237 --> 00:10:38.570
It will just have a stop button.

00:10:38.570 --> 00:10:40.880
And that's your analogy
to the elevator.

00:10:40.880 --> 00:10:46.070
And so I think that the Google
car is an amazing research

00:10:46.070 --> 00:10:50.990
project that might one
day transform mobility.

00:10:50.990 --> 00:10:52.900
But I do think,
with all sincerity--

00:10:52.900 --> 00:10:55.310
so I rode in the
Google car last summer.

00:10:55.310 --> 00:10:56.670
I was blown away.

00:10:56.670 --> 00:10:58.580
I felt like I was on
the beach at Kitty Hawk.

00:10:58.580 --> 00:11:01.370
It's like this just
really profound technology

00:11:01.370 --> 00:11:03.920
that could in the long term
have a very big impact.

00:11:03.920 --> 00:11:05.720
And I have amazing
respect for that team--

00:11:05.720 --> 00:11:08.270
Chris Urmson, Mike
Montemerlo, et cetera.

00:11:08.270 --> 00:11:11.180
But I think in the media and
in others, the technology

00:11:11.180 --> 00:11:14.480
has been a bit overhyped, and
it's poorly misunderstood.

00:11:14.480 --> 00:11:16.970
And a lot of it goes down to
how the car localizes itself,

00:11:16.970 --> 00:11:18.920
how it uses prior
maps, and how they

00:11:18.920 --> 00:11:21.080
simplify the task of driving.

00:11:21.080 --> 00:11:22.820
And so even though
people like Musk

00:11:22.820 --> 00:11:24.870
have said driving
is a solved problem,

00:11:24.870 --> 00:11:26.870
I think we have to be
aware that just because it

00:11:26.870 --> 00:11:30.200
works for Google, doesn't mean
it'll work for everybody else.

00:11:30.200 --> 00:11:33.740
So critical differences between
Google and, say, everyone else.

00:11:33.740 --> 00:11:35.720
And this is with all
respect to all players.

00:11:35.720 --> 00:11:36.886
I'm not trying to criticize.

00:11:36.886 --> 00:11:40.430
It's more just trying
to balance the debate.

00:11:40.430 --> 00:11:42.680
The Google car
localizes on the left

00:11:42.680 --> 00:11:45.440
with a prior map, where they
map the lighter intensity off

00:11:45.440 --> 00:11:47.030
of the ground surface.

00:11:47.030 --> 00:11:49.070
And they will annotate
the map by hand--

00:11:49.070 --> 00:11:52.097
adding pedestrian crossings,
adding stoplights.

00:11:52.097 --> 00:11:53.930
They'll drive a car
around many, many times,

00:11:53.930 --> 00:11:57.146
and then do a SLAM process
to optimize the map.

00:11:57.146 --> 00:11:58.520
But if the world
changes, they're

00:11:58.520 --> 00:11:59.811
going to have to adapt to that.

00:11:59.811 --> 00:12:03.200
Now, they've shown the ability
to do response to construction,

00:12:03.200 --> 00:12:04.772
bicyclists with hand signals.

00:12:04.772 --> 00:12:06.980
When I was in the car we
crossed the railroad tracks.

00:12:06.980 --> 00:12:07.938
That just blew me away.

00:12:07.938 --> 00:12:11.570
I mean, it's pretty
impressive capability but more

00:12:11.570 --> 00:12:14.330
a vision-based approach that
just follows the lane markings.

00:12:14.330 --> 00:12:16.890
If the lane markings are
good, everything's fine.

00:12:16.890 --> 00:12:19.035
In fact, Tesla either
just have released--

00:12:19.035 --> 00:12:21.410
or are about to release--
their autopilot software, which

00:12:21.410 --> 00:12:23.480
is an advanced lane
keeping system.

00:12:23.480 --> 00:12:26.090
And Elon Musk, a few weeks
ago, posted on Twitter

00:12:26.090 --> 00:12:29.240
that there's one last
corner case for us to fix.

00:12:29.240 --> 00:12:32.330
And apparently he-- on part of
his commute in the Los Angeles

00:12:32.330 --> 00:12:34.940
area there is well
defined lane markings.

00:12:34.940 --> 00:12:37.790
And part of it is a concrete
road with weeds and skid marks

00:12:37.790 --> 00:12:38.720
and so forth.

00:12:38.720 --> 00:12:41.630
And he said publicly that the
system works well if the lane

00:12:41.630 --> 00:12:43.040
markings are well-defined.

00:12:43.040 --> 00:12:45.140
But for more challenging
vision conditions

00:12:45.140 --> 00:12:47.690
like looking into the sun
it doesn't work as well.

00:12:47.690 --> 00:12:49.910
And so the critical
difference is

00:12:49.910 --> 00:12:52.640
if you're going to use
the LiDAR with prior maps,

00:12:52.640 --> 00:12:54.740
you can do very precise
localization down

00:12:54.740 --> 00:12:57.200
to less than 10
centimeters accuracy.

00:12:57.200 --> 00:13:00.380
And the way I think about
it is robot navigation

00:13:00.380 --> 00:13:02.570
is about three things--

00:13:02.570 --> 00:13:04.790
where do you want
the robot to be?

00:13:04.790 --> 00:13:06.770
Where does the
robot think it is?

00:13:06.770 --> 00:13:08.810
And where really is the robot?

00:13:08.810 --> 00:13:13.010
And when the robot
thinks it's somewhere,

00:13:13.010 --> 00:13:15.559
but it's really somewhere
different, that's really bad.

00:13:15.559 --> 00:13:16.100
That happens.

00:13:16.100 --> 00:13:19.010
We've lost underwater vehicles
and had very nervous searches

00:13:19.010 --> 00:13:20.090
to find them--

00:13:20.090 --> 00:13:23.780
luckily-- when the
robot made a mistake.

00:13:23.780 --> 00:13:26.150
And so with the Google
approach they really

00:13:26.150 --> 00:13:29.000
nail this "where am I" problem--
the localization problem.

00:13:29.000 --> 00:13:30.871
But it means having
an expensive LiDar.

00:13:30.871 --> 00:13:32.120
It means having accurate maps.

00:13:32.120 --> 00:13:33.740
It means maintaining them.

00:13:33.740 --> 00:13:36.170
One critical distinction
is between level four

00:13:36.170 --> 00:13:37.220
and level three.

00:13:37.220 --> 00:13:38.660
These are definitions
of autonomy

00:13:38.660 --> 00:13:40.970
from the US
government-- from NTSA.

00:13:40.970 --> 00:13:42.950
A level four car
is what Google are

00:13:42.950 --> 00:13:44.780
trying to do now,
which is really,

00:13:44.780 --> 00:13:46.460
you just-- you
could go to sleep.

00:13:46.460 --> 00:13:48.530
The car has a 100% control.

00:13:48.530 --> 00:13:50.350
You couldn't intervene
if you wanted to.

00:13:50.350 --> 00:13:51.350
You just press a button.

00:13:51.350 --> 00:13:51.850
Go to sleep.

00:13:51.850 --> 00:13:53.480
Wake up at your destination.

00:13:53.480 --> 00:13:56.120
Musk has said that he
thinks within five years

00:13:56.120 --> 00:13:59.450
you can go to sleep in your
car, which to me I just--

00:13:59.450 --> 00:14:01.670
five decades would
impress me, to be honest.

00:14:04.710 --> 00:14:08.330
But level three is when the car
is going to do most of the job,

00:14:08.330 --> 00:14:10.880
but you have to take over
if something goes wrong.

00:14:10.880 --> 00:14:15.290
And for example Delphi drove
99% of the way across the US

00:14:15.290 --> 00:14:18.200
in spring of this year,
which is pretty impressive.

00:14:18.200 --> 00:14:20.660
But 50 miles had to
be driven by people--

00:14:20.660 --> 00:14:23.027
getting on and off of
highways and city streets.

00:14:23.027 --> 00:14:24.860
And so there's something
about human nature,

00:14:24.860 --> 00:14:27.110
and the way humans interact
with autonomous systems,

00:14:27.110 --> 00:14:31.190
that it's actually kind of hard
for a person to pay attention.

00:14:31.190 --> 00:14:36.470
Imagine if 99% of the time
the car does it perfectly.

00:14:36.470 --> 00:14:38.780
But 1% of the time it's
about to make a mistake,

00:14:38.780 --> 00:14:41.510
and you have to be
alert to take over.

00:14:41.510 --> 00:14:44.240
And research experience
from aviation

00:14:44.240 --> 00:14:46.400
has shown that humans
are actually bad at that.

00:14:46.400 --> 00:14:49.750
And another issue
is-- and this is--

00:14:49.750 --> 00:14:52.400
I mean, Mountainview is pretty
complicated-- lots of cyclists,

00:14:52.400 --> 00:14:54.590
pedestrians, I mentioned
the railroad crossings,

00:14:54.590 --> 00:14:55.930
construction.

00:14:55.930 --> 00:14:58.790
But in California they've
had this historic drought.

00:14:58.790 --> 00:15:02.340
And most of the testing has been
done with no rain, for example,

00:15:02.340 --> 00:15:03.670
and no snow.

00:15:03.670 --> 00:15:06.370
And if you think about
Boston and Boston roads,

00:15:06.370 --> 00:15:08.540
there are some pretty
challenging situations.

00:15:08.540 --> 00:15:11.500
And so for myself, when I
first-- a couple of years ago

00:15:11.500 --> 00:15:14.020
I said I didn't expect
a taxi in Manhattan

00:15:14.020 --> 00:15:16.090
in my lifetime-- a
fully autonomous taxi--

00:15:16.090 --> 00:15:17.470
to go anywhere in Manhattan.

00:15:17.470 --> 00:15:19.570
And I got criticized
online for saying that.

00:15:19.570 --> 00:15:24.160
So I put a dash cam on
my car, and actually

00:15:24.160 --> 00:15:27.160
had my son record
cell phone footage.

00:15:27.160 --> 00:15:28.870
The upper left is
making a left turn

00:15:28.870 --> 00:15:30.929
near my house in Newton, Mass.

00:15:30.929 --> 00:15:32.470
And if you look to
the right, there's

00:15:32.470 --> 00:15:34.600
cars as far as the eye can see.

00:15:34.600 --> 00:15:36.100
And if you look to
the left, there's

00:15:36.100 --> 00:15:38.650
cars coming at pretty high
rate of speed, with a mailbox,

00:15:38.650 --> 00:15:39.730
and a tree.

00:15:39.730 --> 00:15:44.590
And this is a really challenging
behavior for a human,

00:15:44.590 --> 00:15:47.620
because it requires making
a decision in real time.

00:15:47.620 --> 00:15:50.140
We want very high reliability
in terms of detecting

00:15:50.140 --> 00:15:51.730
the cars coming from the left.

00:15:51.730 --> 00:15:53.560
But the way that I
pulled out is to wave

00:15:53.560 --> 00:15:55.210
at a person in another car.

00:15:55.210 --> 00:15:58.140
And those sort of
nods and waves--

00:15:58.140 --> 00:15:59.890
they're some of the
most challenging forms

00:15:59.890 --> 00:16:02.596
of human-computer interaction.

00:16:02.596 --> 00:16:03.970
So imagine vision
algorithms that

00:16:03.970 --> 00:16:08.260
could detect a person nodding
at you from the other direction.

00:16:08.260 --> 00:16:09.730
Or here's another situation.

00:16:09.730 --> 00:16:12.645
This is going through
Coolidge Corner in Brookline.

00:16:12.645 --> 00:16:14.770
And I'll show a longer
version of this in a second.

00:16:14.770 --> 00:16:15.820
But the light's green.

00:16:15.820 --> 00:16:17.836
And see here-- this
police officer?

00:16:17.836 --> 00:16:19.960
So despite the green light,
the police officer just

00:16:19.960 --> 00:16:23.350
raises their hand, and that
means the signal to stop.

00:16:23.350 --> 00:16:27.190
And so interacting with
crossing guards and people--

00:16:27.190 --> 00:16:29.290
very challenging,
as well as changes

00:16:29.290 --> 00:16:32.460
to the road surface and,
of course, adverse weather.

00:16:32.460 --> 00:16:36.401
And so here's a longer sequence
for that police officer.

00:16:36.401 --> 00:16:38.650
First of all, you'll see
flashing lights on the left--

00:16:38.650 --> 00:16:40.829
which may be flashing
lights, you should pull over.

00:16:40.829 --> 00:16:42.370
Here you should just
drive past them.

00:16:42.370 --> 00:16:43.744
It's just the cop
left his lights

00:16:43.744 --> 00:16:45.400
on when he parked his car.

00:16:45.400 --> 00:16:46.780
But the light's red.

00:16:46.780 --> 00:16:48.550
And this police
officer is waving me

00:16:48.550 --> 00:16:50.380
through a red
light, which I think

00:16:50.380 --> 00:16:51.640
is a really advanced behavior.

00:16:51.640 --> 00:16:53.740
So imagine a car that's--

00:16:53.740 --> 00:16:56.710
imagine the logic for
OK, stop at red lights

00:16:56.710 --> 00:16:59.200
unless there's a police
officer waving you through it,

00:16:59.200 --> 00:17:00.490
and how you get that reliable.

00:17:00.490 --> 00:17:02.823
And now we're going to pull
up to the next intersection,

00:17:02.823 --> 00:17:06.290
and this police officer is
going to stop at a green light.

00:17:06.290 --> 00:17:08.710
And so despite all the
recent progress in vision,

00:17:08.710 --> 00:17:11.260
things like image
labeling, ImageNet--

00:17:11.260 --> 00:17:15.310
most of those systems are
trained with vast archives

00:17:15.310 --> 00:17:18.369
of images from the internet
where there's no context.

00:17:18.369 --> 00:17:21.780
And they're so challenging
for even humans to classify.

00:17:21.780 --> 00:17:24.490
So that if you had
some data sets,

00:17:24.490 --> 00:17:26.230
like the Caltech
pedestrian data set,

00:17:26.230 --> 00:17:29.830
if you got 78% performance,
that's really good.

00:17:29.830 --> 00:17:34.450
But we need 99.9999%
or better performance

00:17:34.450 --> 00:17:36.400
before we're going
to turn cars loose

00:17:36.400 --> 00:17:39.820
in the wild in these
challenging situations.

00:17:39.820 --> 00:17:42.689
Now going back more to
localization and mapping.

00:17:42.689 --> 00:17:44.230
Here I collected
data for about three

00:17:44.230 --> 00:17:46.672
or four weeks of my commuting.

00:17:46.672 --> 00:17:47.755
This is crossing the Mass.

00:17:47.755 --> 00:17:51.040
Ave. Bridge going from
Boston into Cambridge.

00:17:51.040 --> 00:17:52.540
And the lighting
is a little tricky.

00:17:52.540 --> 00:17:54.760
But tell me what's
different between the top

00:17:54.760 --> 00:17:56.480
and the bottom video.

00:17:59.310 --> 00:18:02.560
And notice, by the way, how
close we come to this truck.

00:18:02.560 --> 00:18:04.900
The slightest angular error
in your position estimate,

00:18:04.900 --> 00:18:07.670
really bad things could happen.

00:18:07.670 --> 00:18:10.300
But the top-- this
is a long weekend.

00:18:10.300 --> 00:18:11.784
This is Veterans Day weekend.

00:18:11.784 --> 00:18:12.700
They repaved the Mass.

00:18:12.700 --> 00:18:13.200
Ave. Bridge.

00:18:13.200 --> 00:18:15.970
So on the bottom, the
lane lines are gone.

00:18:15.970 --> 00:18:18.430
And so if you had an
appearance-based localization

00:18:18.430 --> 00:18:20.140
algorithm like
Google's, you would

00:18:20.140 --> 00:18:22.629
need to remap the bridge
before you drove on it.

00:18:22.629 --> 00:18:23.920
But the lines aren't there yet.

00:18:23.920 --> 00:18:25.550
And how well is
it going to work?

00:18:25.550 --> 00:18:28.490
And so, this is just a
really tricky situation.

00:18:28.490 --> 00:18:30.140
And, of course, there's weather.

00:18:30.140 --> 00:18:32.290
Now, snow is
difficult for things

00:18:32.290 --> 00:18:33.910
like traction and control.

00:18:33.910 --> 00:18:36.730
But for perception, if you look
at how the Google car actually

00:18:36.730 --> 00:18:37.630
works--

00:18:37.630 --> 00:18:39.130
if you're going to
localize yourself

00:18:39.130 --> 00:18:44.080
based on precisely knowing
the car's position down

00:18:44.080 --> 00:18:47.590
to centimeters so that you can
predict what you should see,

00:18:47.590 --> 00:18:49.224
then if you can't
see the road surface

00:18:49.224 --> 00:18:50.890
you're not going to
be able to localize.

00:18:50.890 --> 00:18:53.770
And so this is just a
reminder of the sorts of maps

00:18:53.770 --> 00:18:55.420
that Google uses.

00:18:55.420 --> 00:18:58.510
So I think to make it to really
challenging weather and very

00:18:58.510 --> 00:19:00.939
complex environments, we need
a higher level understanding

00:19:00.939 --> 00:19:01.480
of the world.

00:19:01.480 --> 00:19:04.450
I think more a semantic or
object-based understanding

00:19:04.450 --> 00:19:05.660
of the world.

00:19:05.660 --> 00:19:08.680
And then, of course, there's
difficulties in perception.

00:19:08.680 --> 00:19:11.400
And so what do you
see in this picture?

00:19:16.010 --> 00:19:17.800
The sun?

00:19:17.800 --> 00:19:19.326
There's a green light there.

00:19:19.326 --> 00:19:20.950
I realize the lighting
is really harsh,

00:19:20.950 --> 00:19:22.990
and maybe you could do
polarization or something

00:19:22.990 --> 00:19:23.490
better.

00:19:23.490 --> 00:19:27.220
But does anyone see the
traffic cop standing there?

00:19:27.220 --> 00:19:29.200
You can just make out his legs.

00:19:29.200 --> 00:19:32.230
There's a policeman there
who gave me this little wave,

00:19:32.230 --> 00:19:34.300
even though I was sort
of blinded by the sun.

00:19:34.300 --> 00:19:35.560
And he walked out and
put his back to me

00:19:35.560 --> 00:19:37.060
and was waving
pedestrians across,

00:19:37.060 --> 00:19:38.540
even though the light was green.

00:19:38.540 --> 00:19:41.020
So a purely
vision-based system is

00:19:41.020 --> 00:19:45.980
going to just need dramatic
leaps in visual performance.

00:19:45.980 --> 00:19:48.250
So to wrap up the
self-driving car part,

00:19:48.250 --> 00:19:51.550
I think the big questions going
forward-- technical challenges,

00:19:51.550 --> 00:19:55.990
maintaining the maps,
dealing with adverse weather,

00:19:55.990 --> 00:19:57.280
interacting with people--

00:19:57.280 --> 00:19:59.740
both inside and
outside of the car--

00:19:59.740 --> 00:20:02.860
and then getting truly robust
computer vision algorithms.

00:20:02.860 --> 00:20:05.440
We want to get in a totally
different place on the ROC

00:20:05.440 --> 00:20:07.750
curves, or the
precision recall curves,

00:20:07.750 --> 00:20:11.830
where approaching perfect
detection with no false alarms.

00:20:11.830 --> 00:20:14.930
And that's a really
hard thing to do.

00:20:14.930 --> 00:20:16.990
So I've worked my whole
life on the robot mapping

00:20:16.990 --> 00:20:18.160
and localization problem.

00:20:18.160 --> 00:20:19.960
And for this audience
I wanted to just

00:20:19.960 --> 00:20:22.800
ask you a little question.

00:20:22.800 --> 00:20:25.920
Does anyone know what the
2014 Nobel Prize in medicine

00:20:25.920 --> 00:20:28.701
or physiology was for?

00:20:28.701 --> 00:20:29.200
Anybody?

00:20:29.200 --> 00:20:30.755
AUDIENCE: [INAUDIBLE]

00:20:30.755 --> 00:20:31.630
AUDIENCE: Grid cells.

00:20:31.630 --> 00:20:32.490
JOHN LEONARD: Grid cells.

00:20:32.490 --> 00:20:33.770
Grid cells and place cells.

00:20:33.770 --> 00:20:38.030
And so this has been
called SLAM in the brain.

00:20:38.030 --> 00:20:38.950
Now, you might argue.

00:20:38.950 --> 00:20:41.920
And we might be very
far from knowing.

00:20:41.920 --> 00:20:44.474
But I think it's just
really exciting to--

00:20:44.474 --> 00:20:45.640
so for myself, I'll explain.

00:20:45.640 --> 00:20:47.680
I've had what's called
an ONR MURI grant--

00:20:47.680 --> 00:20:50.610
multidisciplinary university
research initiative grant--

00:20:50.610 --> 00:20:52.510
with Mike Hasselmo
and his colleagues

00:20:52.510 --> 00:20:53.980
at Boston University.

00:20:53.980 --> 00:20:56.560
And these are a couple
of Mike's videos.

00:20:56.560 --> 00:20:59.650
And so, I think Matt
Wilson spoke to your group.

00:20:59.650 --> 00:21:04.720
And the notion that in
the entorhinal cortex

00:21:04.720 --> 00:21:06.994
that there is this sort
of position information

00:21:06.994 --> 00:21:08.410
that's very metrical,
and it seems

00:21:08.410 --> 00:21:10.390
to be at the heart
of memory formation,

00:21:10.390 --> 00:21:14.150
to me is very powerful
and very important.

00:21:14.150 --> 00:21:17.570
And so, we have this underlying
question of representation.

00:21:17.570 --> 00:21:18.940
How do we represent the world?

00:21:18.940 --> 00:21:24.130
And I believe location
is just absolutely

00:21:24.130 --> 00:21:27.057
vital to building
memories and to developing

00:21:27.057 --> 00:21:28.390
advanced reasoning in the world.

00:21:28.390 --> 00:21:31.720
And the fact that
grid cells exist--

00:21:31.720 --> 00:21:34.240
to me-- and they have this
role in memory formation

00:21:34.240 --> 00:21:37.390
is just this really
exciting concept.

00:21:37.390 --> 00:21:41.800
And so, in robotics
we call the problem

00:21:41.800 --> 00:21:44.710
of how a robot builds a map
and uses that map to navigate,

00:21:44.710 --> 00:21:47.270
SLAM-- simultaneous
localization and mapping.

00:21:47.270 --> 00:21:50.470
This is for a PR2 robot being
driven around the second floor

00:21:50.470 --> 00:21:52.420
of our building, not far
from Patrick's office

00:21:52.420 --> 00:21:54.430
if you recognize any of that.

00:21:54.430 --> 00:21:59.110
And this is using stereo vision.

00:21:59.110 --> 00:22:01.000
My PhD student,
Hordur Johannsson,

00:22:01.000 --> 00:22:02.620
who graduated a
couple of years ago,

00:22:02.620 --> 00:22:05.110
created a system to
do real time SLAM

00:22:05.110 --> 00:22:06.790
and try to address
how to get temporally

00:22:06.790 --> 00:22:08.829
scalable representations.

00:22:08.829 --> 00:22:10.870
And one thing you'll see
as the robot goes around

00:22:10.870 --> 00:22:12.340
occasionally is
loop closing, where

00:22:12.340 --> 00:22:14.381
the robot might come back
and have like, an error

00:22:14.381 --> 00:22:15.890
and then correct that error.

00:22:15.890 --> 00:22:18.370
So this is the part of the
SLAM problem that in some ways

00:22:18.370 --> 00:22:20.050
is well understood
in robotics, which

00:22:20.050 --> 00:22:25.390
is how you detect features from
images, track them over time,

00:22:25.390 --> 00:22:28.480
and try to bootstrap up,
building a representation

00:22:28.480 --> 00:22:30.805
and using that to
locate your estimation.

00:22:30.805 --> 00:22:33.370
And I've worked on
this my whole career.

00:22:33.370 --> 00:22:38.020
And as a grad student at Oxford,
I had very primitive sensors.

00:22:38.020 --> 00:22:41.140
So for a historical SLAM talk I
recently digitized an old video

00:22:41.140 --> 00:22:42.650
and some old pictures.

00:22:42.650 --> 00:22:46.020
This was in the basement of the
engineering building at Oxford.

00:22:46.020 --> 00:22:49.369
This is just the localization
part of how you have a map,

00:22:49.369 --> 00:22:51.160
and you generate
predictions-- in this case

00:22:51.160 --> 00:22:53.020
for sonar measurements.

00:22:53.020 --> 00:22:55.620
And at the time there we had--

00:22:55.620 --> 00:22:57.190
I'm sitting at a
SUN workstation.

00:22:57.190 --> 00:22:59.320
To my left is something
called a data cube,

00:22:59.320 --> 00:23:02.320
which for about $100,000
could just barely

00:23:02.320 --> 00:23:07.180
do like real time frame grabbing
and then edge detection out.

00:23:07.180 --> 00:23:09.730
And so vision just wasn't ready.

00:23:09.730 --> 00:23:12.157
And the exciting
thing now in our field

00:23:12.157 --> 00:23:13.990
is vision is ready--
that we're really using

00:23:13.990 --> 00:23:15.622
vision in a substantial way.

00:23:15.622 --> 00:23:17.080
But I think a lot
about prediction.

00:23:17.080 --> 00:23:18.850
If you know your
position, you can

00:23:18.850 --> 00:23:21.430
predict what you should see
and create a feedback loop.

00:23:21.430 --> 00:23:23.310
And that's sort of what
we're trying to do.

00:23:23.310 --> 00:23:27.760
And so SLAM is a
wonderful problem,

00:23:27.760 --> 00:23:30.790
I believe, for addressing a
whole great set of questions,

00:23:30.790 --> 00:23:32.980
because there are these
different axes of difficulty

00:23:32.980 --> 00:23:34.940
that interact with one another.

00:23:34.940 --> 00:23:36.210
And one is representation.

00:23:36.210 --> 00:23:37.460
How do we represent the world?

00:23:37.460 --> 00:23:39.280
And I think that
question-- we still have

00:23:39.280 --> 00:23:41.330
a ton of things to think about.

00:23:41.330 --> 00:23:42.310
Another is inference.

00:23:42.310 --> 00:23:43.690
We want to do real
time inference

00:23:43.690 --> 00:23:45.910
about what's where in the
world and how we combine it

00:23:45.910 --> 00:23:47.410
all together.

00:23:47.410 --> 00:23:49.870
And finally, there's a systems
in autonomy access, where

00:23:49.870 --> 00:23:52.030
we want to build systems,
and deploy them, and have

00:23:52.030 --> 00:23:55.900
them operate robustly and
reliably in the world.

00:23:55.900 --> 00:23:59.920
So in SLAM, here's an
example of how we pose

00:23:59.920 --> 00:24:01.720
this as an inference problem.

00:24:01.720 --> 00:24:05.340
This is from the classic
Victoria Park data

00:24:05.340 --> 00:24:07.600
set from Sydney University.

00:24:07.600 --> 00:24:10.720
A robot drives around, in this
case, a park with some trees.

00:24:10.720 --> 00:24:12.800
There are landmarks
shown in green.

00:24:12.800 --> 00:24:14.622
The robot's positioner
drifts over time.

00:24:14.622 --> 00:24:15.830
We have dead reckoning error.

00:24:15.830 --> 00:24:17.820
That's shown in blue.

00:24:17.820 --> 00:24:20.840
And we estimate the trajectory
of the robot in red,

00:24:20.840 --> 00:24:22.250
and the position
of the landmarks

00:24:22.250 --> 00:24:23.349
from relative measurement.

00:24:23.349 --> 00:24:24.890
So as you take
relative measurements,

00:24:24.890 --> 00:24:26.348
and you move through
the world, how

00:24:26.348 --> 00:24:27.910
do you put that all together?

00:24:27.910 --> 00:24:29.990
And so we, cast this
as an inference problem

00:24:29.990 --> 00:24:33.680
where we have the robot
poses, the odometric inputs,

00:24:33.680 --> 00:24:36.350
landmarks-- you can do it
with or without landmarks--

00:24:36.350 --> 00:24:37.450
and measurements.

00:24:37.450 --> 00:24:40.280
And an interesting thing-- so
we have this inference problem

00:24:40.280 --> 00:24:41.700
on a belief network.

00:24:41.700 --> 00:24:44.150
The key thing about SLAM is
it's building up over time.

00:24:44.150 --> 00:24:46.280
So you start with nothing
and the problem's growing

00:24:46.280 --> 00:24:47.480
ever larger.

00:24:47.480 --> 00:24:50.450
And, let's see,
if I had to say--

00:24:50.450 --> 00:24:53.060
25 years of thinking about
this up through 2012,

00:24:53.060 --> 00:24:55.130
the most important
thing I learned

00:24:55.130 --> 00:24:58.520
is that maintaining sparsity in
the underlying representation

00:24:58.520 --> 00:24:59.150
is critical.

00:24:59.150 --> 00:25:00.650
And, in fact, for
biological systems

00:25:00.650 --> 00:25:03.290
I wonder if there is
evidence of sparsity.

00:25:03.290 --> 00:25:05.820
Because sparsity is the
key to doing efficient

00:25:05.820 --> 00:25:08.090
inference when you
pose this problem.

00:25:08.090 --> 00:25:10.100
And so many algorithms
have basically

00:25:10.100 --> 00:25:13.160
boiled down to maintaining
sparsity and the underlying

00:25:13.160 --> 00:25:14.579
representations.

00:25:14.579 --> 00:25:16.370
So just briefly, the
most important thing I

00:25:16.370 --> 00:25:20.440
learned since then in
the last few years--

00:25:20.440 --> 00:25:23.640
I'm really excited by building
dense representations.

00:25:23.640 --> 00:25:27.290
So this is work in collaboration
with some folks in Ireland--

00:25:27.290 --> 00:25:29.840
Tom Whelan, John McDonald--
building on KinectFusion from

00:25:29.840 --> 00:25:32.570
Richard Newcombe
and Andrew Davison--

00:25:32.570 --> 00:25:36.080
how you can use a GPU to build
a volumetric representation,

00:25:36.080 --> 00:25:39.085
and build rich, dense models,
and estimate your motion as you

00:25:39.085 --> 00:25:39.960
go through the world.

00:25:39.960 --> 00:25:42.680
So this is something we
call continuous or spatially

00:25:42.680 --> 00:25:44.449
extended KinectFusion.

00:25:44.449 --> 00:25:46.240
This little video here
from three years ago

00:25:46.240 --> 00:25:49.100
is going on in an
apartment in Ireland.

00:25:49.100 --> 00:25:50.960
And I'll show you
the end result.

00:25:50.960 --> 00:25:53.570
Just hand-carrying a
sensor through the world--

00:25:53.570 --> 00:25:56.060
and you can see the quality
of the reconstructions

00:25:56.060 --> 00:25:57.590
you can build, say,
in the bathroom,

00:25:57.590 --> 00:26:02.216
the sink, the tub, the stairs,
to have really rich 3D models

00:26:02.216 --> 00:26:04.340
that we can build and then
enable the more advanced

00:26:04.340 --> 00:26:06.080
interactions that Russ showed.

00:26:06.080 --> 00:26:07.380
That's fantastic.

00:26:07.380 --> 00:26:08.970
And I mentioned loop closing--

00:26:08.970 --> 00:26:10.386
something we did
a couple of years

00:26:10.386 --> 00:26:13.100
ago was adding loop closing to
these dense representations.

00:26:13.100 --> 00:26:16.340
So this is-- again, in
CSAIL-- this is walking around

00:26:16.340 --> 00:26:19.880
the Stata Center with
about eight minutes of data

00:26:19.880 --> 00:26:21.500
going up and down stairs.

00:26:21.500 --> 00:26:26.180
If you watch the two blue chairs
near Randy Davis's office,

00:26:26.180 --> 00:26:28.010
you can see how they
get locked into place

00:26:28.010 --> 00:26:29.640
as you correct the error.

00:26:29.640 --> 00:26:32.210
So this is taking mesh
deformation techniques

00:26:32.210 --> 00:26:33.650
from graphics and combining it.

00:26:33.650 --> 00:26:35.960
So the underlying pose
graph representation

00:26:35.960 --> 00:26:38.390
is like a foundation or
a skeleton on which you

00:26:38.390 --> 00:26:41.870
build the rich representation.

00:26:41.870 --> 00:26:42.500
OK.

00:26:42.500 --> 00:26:46.305
So this is the
end resulting map.

00:26:46.305 --> 00:26:48.680
And there's been some really
exciting work just this year

00:26:48.680 --> 00:26:51.560
from Whelan and from Newcombe
in this space of doing

00:26:51.560 --> 00:26:55.520
deformable objects, and
then really scalable

00:26:55.520 --> 00:26:57.740
algorithms where you can
sort of paint the world.

00:26:57.740 --> 00:26:59.365
So the final thing
I want to talk about

00:26:59.365 --> 00:27:01.370
in my last few minutes
is our latest work

00:27:01.370 --> 00:27:04.280
of using object-based
representations.

00:27:04.280 --> 00:27:06.440
And for this audience,
I think if you go back

00:27:06.440 --> 00:27:09.260
to David Marr, who I
feel is unappreciated

00:27:09.260 --> 00:27:12.470
in the historical
sense of how I feel,

00:27:12.470 --> 00:27:14.800
that vision is the
process of discovering

00:27:14.800 --> 00:27:18.500
from images what is present
in the world and where it is.

00:27:18.500 --> 00:27:21.890
And to me, the what
and where are coupled.

00:27:21.890 --> 00:27:23.360
And maybe that's
been lost a bit.

00:27:23.360 --> 00:27:26.000
And I think that's one
way in which robotics

00:27:26.000 --> 00:27:29.570
can help, I think, with
vision and brain sciences.

00:27:29.570 --> 00:27:33.170
I think we need to develop
object-based understanding

00:27:33.170 --> 00:27:33.714
of the world.

00:27:33.714 --> 00:27:35.630
So instead of just having
representations that

00:27:35.630 --> 00:27:38.600
are a massive amount of
points or purely appearance,

00:27:38.600 --> 00:27:40.630
where we can start
to build higher level

00:27:40.630 --> 00:27:42.830
and symbolic understanding
of the world.

00:27:42.830 --> 00:27:45.680
And so I want to build
rich representations

00:27:45.680 --> 00:27:48.110
that leverage knowledge
of your location

00:27:48.110 --> 00:27:50.570
to better understand where
objects are and knowledge

00:27:50.570 --> 00:27:53.060
about objects to better
understand your location.

00:27:53.060 --> 00:27:57.070
And just as a step in that
direction, my student, Sudeep

00:27:57.070 --> 00:28:00.050
Pallai, who was one
of Seth's students,

00:28:00.050 --> 00:28:05.180
has an RSS paper where we
looked at coupling using SLAM

00:28:05.180 --> 00:28:09.500
to get better object recognition
by effectively-- so here's

00:28:09.500 --> 00:28:13.546
an example of an input data
stream from Peter Fox's group.

00:28:13.546 --> 00:28:15.170
There's just some
objects on the table.

00:28:15.170 --> 00:28:17.510
I realize it's a relatively
uncluttered scene.

00:28:17.510 --> 00:28:20.630
But this has been a benchmark
for RGBD perception.

00:28:20.630 --> 00:28:22.730
And so, if you
combine data as you

00:28:22.730 --> 00:28:26.540
move from the world
using a SLAM system to do

00:28:26.540 --> 00:28:29.600
3D reconstruction on the
scene, and then using

00:28:29.600 --> 00:28:32.960
the reconstructed points to
help improve the prediction

00:28:32.960 --> 00:28:35.330
process for object
recognition, it

00:28:35.330 --> 00:28:41.427
leads to a more scalable
system for recognizing objects.

00:28:41.427 --> 00:28:43.010
And it comes back
to this notion to me

00:28:43.010 --> 00:28:45.530
that a big part of perception
is prediction-- the ability

00:28:45.530 --> 00:28:48.162
to predict what you see
from a given location.

00:28:48.162 --> 00:28:50.120
And so what we're doing
is we're leveraging off

00:28:50.120 --> 00:28:54.140
techniques and object detection,
featuring coding and the newer

00:28:54.140 --> 00:28:55.670
SLAM algorithms,
and particularly

00:28:55.670 --> 00:28:59.290
the semi-dense orb SLAM
technique from Zaragoza, Spain.

00:28:59.290 --> 00:29:01.820
And so I'm just going
to jump to the end here.

00:29:01.820 --> 00:29:06.500
The key concept is
that by combining

00:29:06.500 --> 00:29:09.580
SLAM with object detection we
get much better performance

00:29:09.580 --> 00:29:11.550
and object recognition.

00:29:11.550 --> 00:29:14.539
So on the left shows our system.

00:29:14.539 --> 00:29:16.580
On the right is a classical
approach just looking

00:29:16.580 --> 00:29:18.242
at individual frames.

00:29:18.242 --> 00:29:19.700
And you can see,
for example, here,

00:29:19.700 --> 00:29:21.800
the red cup that's
been misclassified

00:29:21.800 --> 00:29:25.580
would get substantially better
performance by using location

00:29:25.580 --> 00:29:28.290
to cue the object
detection techniques.

00:29:28.290 --> 00:29:29.100
All right.

00:29:29.100 --> 00:29:30.110
So I'm going to wrap up.

00:29:30.110 --> 00:29:32.840
And just a little bit of
biological inspiration

00:29:32.840 --> 00:29:34.760
from our BU
collaborators, Eichenbaum

00:29:34.760 --> 00:29:37.430
has looked at the what
and the where pathways

00:29:37.430 --> 00:29:39.240
in the entorhinal cortex.

00:29:39.240 --> 00:29:42.260
And there's this duality between
location-based and object-based

00:29:42.260 --> 00:29:43.940
representations in the brain.

00:29:43.940 --> 00:29:46.140
And I think that's
very important.

00:29:46.140 --> 00:29:46.670
OK.

00:29:46.670 --> 00:29:50.834
So my dream is persistent
autonomy and lifelong map

00:29:50.834 --> 00:29:52.250
learning and making
things robust.

00:29:52.250 --> 00:29:53.666
And just for this
group I made a--

00:29:53.666 --> 00:29:55.430
I just want to
pose some questions

00:29:55.430 --> 00:29:57.290
on the biological side,
and I'll stop here.

00:29:57.290 --> 00:30:01.100
So some questions-- do
biological representations

00:30:01.100 --> 00:30:03.350
support multiple
location hypotheses?

00:30:03.350 --> 00:30:05.780
Even though we think
we know where we are,

00:30:05.780 --> 00:30:08.679
robots are faced with multimodal
situations all the time.

00:30:08.679 --> 00:30:10.220
And I wonder if
there is any evidence

00:30:10.220 --> 00:30:13.790
for multiple hypotheses in
the underlying representations

00:30:13.790 --> 00:30:17.690
in the brain, even if they don't
rise to the conscious level,

00:30:17.690 --> 00:30:20.780
and how experiences
build over time.

00:30:20.780 --> 00:30:23.850
And the question-- what are
the grid cells really doing?

00:30:23.850 --> 00:30:26.120
Are they a form of
path integration?

00:30:26.120 --> 00:30:29.060
Or there obviously, to me,
seems to be some correction.

00:30:29.060 --> 00:30:32.450
And my crazy hypothesis as
a non-brain brain scientist

00:30:32.450 --> 00:30:35.360
is, do grid cells serve as
an indexing mechanism that

00:30:35.360 --> 00:30:39.290
effectively facilitates search--
so a location index search

00:30:39.290 --> 00:30:42.110
so that you can have
these pointers to what

00:30:42.110 --> 00:30:46.270
and where information
get coupled together.