1
00:00:01,640 --> 00:00:04,040
The following content is
provided under a Creative

2
00:00:04,040 --> 00:00:05,580
Commons license.

3
00:00:05,580 --> 00:00:07,880
Your support will help
MIT OpenCourseWare

4
00:00:07,880 --> 00:00:12,270
continue to offer high quality
educational resources for free.

5
00:00:12,270 --> 00:00:14,870
To make a donation or
view additional materials

6
00:00:14,870 --> 00:00:18,830
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,830 --> 00:00:21,839
at ocw.mit.edu.

8
00:00:21,839 --> 00:00:22,880
JOHN LEONARD: OK, thanks.

9
00:00:22,880 --> 00:00:24,338
Thanks for the
opportunity to talk.

10
00:00:24,338 --> 00:00:25,940
So hi, everyone.

11
00:00:25,940 --> 00:00:27,770
It's a great pleasure
to talk here at MBL.

12
00:00:27,770 --> 00:00:29,769
I've been coming to the
Woods Hole Oceanographic

13
00:00:29,769 --> 00:00:33,440
Institution for many years as
my first thing over here at MBL.

14
00:00:33,440 --> 00:00:37,370
And so I'm going to try to cover
three different topics, which

15
00:00:37,370 --> 00:00:39,110
is probably a little
ambitious on time.

16
00:00:39,110 --> 00:00:41,420
But there's so much
I'd love to say to you.

17
00:00:41,420 --> 00:00:43,370
I want to talk about
self-driving cars.

18
00:00:43,370 --> 00:00:46,850
And use it as a context
to think about questions

19
00:00:46,850 --> 00:00:49,910
of representation for
localization and mapping,

20
00:00:49,910 --> 00:00:52,550
and maybe connect it into
some of the brain questions

21
00:00:52,550 --> 00:00:54,046
that you folks
are interested in,

22
00:00:54,046 --> 00:00:55,670
and time permitting,
at the end mention

23
00:00:55,670 --> 00:00:58,640
a little bit of work we've
done on object-based mapping

24
00:00:58,640 --> 00:00:59,600
in my lab.

25
00:00:59,600 --> 00:01:03,180
So my background-- I
grew up in Philadelphia.

26
00:01:03,180 --> 00:01:05,540
Went to UPenn for engineering.

27
00:01:05,540 --> 00:01:08,660
But then went to Oxford to do
my PhD at a very exciting time

28
00:01:08,660 --> 00:01:10,910
when their computer vision
and robotics group was just

29
00:01:10,910 --> 00:01:13,700
being formed at Oxford
under Michael Brady.

30
00:01:13,700 --> 00:01:16,190
And then I came back to
MIT and started working

31
00:01:16,190 --> 00:01:17,270
with underwater vehicles.

32
00:01:17,270 --> 00:01:19,850
And that's when I got involved
with Woods Hole Oceanographic

33
00:01:19,850 --> 00:01:20,990
Institution.

34
00:01:20,990 --> 00:01:23,390
And I was very fortunate
to join the AI lab back

35
00:01:23,390 --> 00:01:27,710
around 2002, which
became part of CSAIL.

36
00:01:27,710 --> 00:01:29,390
And really, I've
been able to work

37
00:01:29,390 --> 00:01:32,450
with really amazing
colleagues and amazing robots

38
00:01:32,450 --> 00:01:35,100
in a challenging
set of environments.

39
00:01:35,100 --> 00:01:37,250
So autonomous
underwater vehicles

40
00:01:37,250 --> 00:01:42,020
provide a very unique challenge
because we have very poor

41
00:01:42,020 --> 00:01:43,170
communications to them.

42
00:01:43,170 --> 00:01:44,720
Typically, we use
acoustic modems

43
00:01:44,720 --> 00:01:49,040
that might give you 96 bytes if
you're lucky every 10 seconds

44
00:01:49,040 --> 00:01:50,810
to a few kilometers range.

45
00:01:50,810 --> 00:01:55,280
And so we also need to think
about the sort of constraints

46
00:01:55,280 --> 00:01:57,860
of running in real
time onboard a vehicle.

47
00:01:57,860 --> 00:02:02,690
And so the sort of work
that my lab's done--

48
00:02:02,690 --> 00:02:04,970
the more we investigate
more fundamental questions

49
00:02:04,970 --> 00:02:07,640
about robot perception,
navigation, and mapping,

50
00:02:07,640 --> 00:02:09,780
we also are involved
in building systems.

51
00:02:09,780 --> 00:02:13,310
So this is a project I did for
the Office of Naval Research

52
00:02:13,310 --> 00:02:16,310
some years ago using
small vehicles that

53
00:02:16,310 --> 00:02:18,410
would reacquire
mine-like targets

54
00:02:18,410 --> 00:02:19,910
on the bottom for the Navy.

55
00:02:19,910 --> 00:02:22,100
And so this is an example
of a more applied system

56
00:02:22,100 --> 00:02:24,710
where we had a very small
resource-constrained platform.

57
00:02:24,710 --> 00:02:27,050
And the sort of work we
did is a robot built a map

58
00:02:27,050 --> 00:02:29,300
as it performed its
mission, and then matched

59
00:02:29,300 --> 00:02:31,620
the map against
the prior map to do

60
00:02:31,620 --> 00:02:33,766
terminal guidance to a target.

61
00:02:33,766 --> 00:02:35,390
Another big system
I was involved with,

62
00:02:35,390 --> 00:02:37,667
as Russ mentioned, was
the Urban Challenge.

63
00:02:37,667 --> 00:02:39,500
And I'll say a bit about
that in the context

64
00:02:39,500 --> 00:02:41,510
of self-driving cars.

65
00:02:41,510 --> 00:02:43,670
So let's see.

66
00:02:43,670 --> 00:02:47,780
So who's heard any of the
recent statements from Elon Musk

67
00:02:47,780 --> 00:02:49,160
from Tesla?

68
00:02:49,160 --> 00:02:55,630
So he said self-driving
cars are solved- he said.

69
00:02:55,630 --> 00:02:59,892
And a particular thing that
he said that just made my--

70
00:02:59,892 --> 00:03:01,850
I don't know, maybe steam
came out of my head--

71
00:03:01,850 --> 00:03:04,520
was that he compared
autonomous cars

72
00:03:04,520 --> 00:03:06,709
with elevators that used
to require operators

73
00:03:06,709 --> 00:03:07,750
but are now self-service.

74
00:03:07,750 --> 00:03:10,520
So imagine you getting in
a car, pressing a button,

75
00:03:10,520 --> 00:03:14,150
and arriving at MIT in
Cambridge 80 miles away,

76
00:03:14,150 --> 00:03:18,110
navigating through
the Boston downtown

77
00:03:18,110 --> 00:03:19,626
highways and intersections.

78
00:03:19,626 --> 00:03:20,750
And maybe that will happen.

79
00:03:20,750 --> 00:03:23,010
But I think it's going to
take a lot longer than folks

80
00:03:23,010 --> 00:03:23,510
are saying.

81
00:03:23,510 --> 00:03:25,760
And some of that comes from
just fundamental questions

82
00:03:25,760 --> 00:03:28,400
and intelligence and robotics.

83
00:03:28,400 --> 00:03:31,580
So in a nutshell, when Musk
says that self-driving is solved

84
00:03:31,580 --> 00:03:35,480
I think he's wrong, as
much as I admire what

85
00:03:35,480 --> 00:03:37,720
Tesla and SpaceX have done.

86
00:03:37,720 --> 00:03:40,010
And so to talk
about that, I think

87
00:03:40,010 --> 00:03:42,620
we need to be very honest as
a field about our failures

88
00:03:42,620 --> 00:03:44,060
as well as our
successes, and try

89
00:03:44,060 --> 00:03:47,690
to balance what you hear in the
media with the reality of where

90
00:03:47,690 --> 00:03:49,340
I think we are.

91
00:03:49,340 --> 00:03:51,890
And so I wanted
to quote verbatim

92
00:03:51,890 --> 00:03:55,470
what Russ said about
the robotics challenge,

93
00:03:55,470 --> 00:03:59,390
about a project that was
so exhausting and just

94
00:03:59,390 --> 00:04:03,100
all-consuming and so
stressful, yet so rewarding.

95
00:04:03,100 --> 00:04:06,200
So we did this in
2006 and 2007--

96
00:04:06,200 --> 00:04:08,210
my wonderful colleagues,
Seth Teller, John Howe,

97
00:04:08,210 --> 00:04:09,350
Amelia Fratoli--

98
00:04:09,350 --> 00:04:11,170
amazing students and postdocs.

99
00:04:11,170 --> 00:04:13,040
We had a very large team.

100
00:04:13,040 --> 00:04:14,810
And we tried to push
the limit on what

101
00:04:14,810 --> 00:04:17,970
was possible with perception
and real-time motion planning.

102
00:04:17,970 --> 00:04:20,209
So our vehicle built a
local map as it traveled

103
00:04:20,209 --> 00:04:23,820
from its perceptual data,
using data from laser scanners

104
00:04:23,820 --> 00:04:25,070
and cameras.

105
00:04:25,070 --> 00:04:27,410
And we didn't want to
blindly follow GPS.

106
00:04:27,410 --> 00:04:29,630
We wanted the car to
make its own decisions

107
00:04:29,630 --> 00:04:31,929
because GPS navigation was
part of the original quest

108
00:04:31,929 --> 00:04:32,720
with the challenge.

109
00:04:32,720 --> 00:04:35,570
And so Seth Teller and
his student, Albert Wang,

110
00:04:35,570 --> 00:04:38,380
developed a vision-based
perceptual system

111
00:04:38,380 --> 00:04:42,260
where the car tried to detect
from curbs and lane markings

112
00:04:42,260 --> 00:04:44,450
in very challenging
vision conditions.

113
00:04:44,450 --> 00:04:46,670
For example, looking
into the sun, which

114
00:04:46,670 --> 00:04:47,720
you'll see in a second--

115
00:04:47,720 --> 00:04:51,770
really challenging situation for
trying to perceive the world.

116
00:04:51,770 --> 00:04:53,120
And so our vehicle--

117
00:04:53,120 --> 00:04:56,480
at the time, we went a little
crazy on the computation.

118
00:04:56,480 --> 00:05:00,500
We had 10 blades,
each with four cores--

119
00:05:00,500 --> 00:05:04,190
40 cores-- which may not seem
a lot now, but we needed 3.5

120
00:05:04,190 --> 00:05:07,340
kilowatts just to power
the computer at full tilt.

121
00:05:07,340 --> 00:05:09,650
We fully loaded the computer
with a randomized motion

122
00:05:09,650 --> 00:05:11,870
planner, with all these
perception algorithms.

123
00:05:11,870 --> 00:05:14,240
We had a Velodyne laser
scanner on the roof.

124
00:05:14,240 --> 00:05:19,700
And about 12 other laser
scanners, 5 cameras, 15 radars,

125
00:05:19,700 --> 00:05:22,710
and we really pushed the
envelope on algorithms.

126
00:05:22,710 --> 00:05:25,430
And so when faced with a
choice in a DARPA challenge,

127
00:05:25,430 --> 00:05:27,010
if you want to win
at all costs you

128
00:05:27,010 --> 00:05:29,135
might simplify, or try to
read the rules carefully,

129
00:05:29,135 --> 00:05:30,782
or guess the rule
simplifications.

130
00:05:30,782 --> 00:05:32,240
But that would have
meant just sort

131
00:05:32,240 --> 00:05:34,280
of turning off the work
of our PhD students,

132
00:05:34,280 --> 00:05:35,730
and we didn't want to do that.

133
00:05:35,730 --> 00:05:37,897
So at the end of the day,
all credit to the teams

134
00:05:37,897 --> 00:05:38,480
that did well.

135
00:05:38,480 --> 00:05:40,610
Carnegie Mellon--
first, $2 million,

136
00:05:40,610 --> 00:05:43,610
Stanford-- second, $1 million,
Virginia Tech-- third,

137
00:05:43,610 --> 00:05:45,830
half a million
dollars, MIT-- fourth,

138
00:05:45,830 --> 00:05:47,770
and nothing for fourth place.

139
00:05:47,770 --> 00:05:51,470
But it was quite an
amazing experience.

140
00:05:51,470 --> 00:05:54,386
And in the spirit of
advertising our failures

141
00:05:54,386 --> 00:05:55,760
I think I have
time to show this.

142
00:05:55,760 --> 00:05:57,440
This used to be painful
for me to watch.

143
00:05:57,440 --> 00:05:59,239
But now I've gotten over it.

144
00:05:59,239 --> 00:05:59,780
This is our--

145
00:05:59,780 --> 00:06:00,446
[VIDEO PLAYBACK]

146
00:06:00,446 --> 00:06:04,010
- Let's check in once
again with the boss.

147
00:06:04,010 --> 00:06:06,170
JOHN LEONARD: Even though
we finished the race,

148
00:06:06,170 --> 00:06:08,180
we had a few incidents so
DARPA stopped things and let

149
00:06:08,180 --> 00:06:08,680
us continue.

150
00:06:08,680 --> 00:06:09,514
- --across the line.

151
00:06:09,514 --> 00:06:11,513
JOHN LEONARD: Carnegie-Mellon,
who won the race.

152
00:06:11,513 --> 00:06:12,280
Why did that stop?

153
00:06:12,280 --> 00:06:13,626
Let's see.

154
00:06:13,626 --> 00:06:17,330
- --at the end of mission
two behind Virginia Tech.

155
00:06:17,330 --> 00:06:21,087
Virginia Tech got
a little issue.

156
00:06:21,087 --> 00:06:21,920
[INAUDIBLE] Here's--

157
00:06:21,920 --> 00:06:24,545
JOHN LEONARD: We were trying to
pass Cornell for a few minutes.

158
00:06:24,545 --> 00:06:26,380
- Looks like they're stopped.

159
00:06:26,380 --> 00:06:29,140
And it looks like they're--

160
00:06:29,140 --> 00:06:31,550
that the 79 is trying
to pass and has

161
00:06:31,550 --> 00:06:35,791
passed the chase vehicle
for Skynet, the 26 vehicle.

162
00:06:35,791 --> 00:06:36,290
Wow.

163
00:06:36,290 --> 00:06:37,450
And now he's done it.

164
00:06:37,450 --> 00:06:39,020
And Talos is going to pass.

165
00:06:39,020 --> 00:06:40,680
Very aggressive.

166
00:06:40,680 --> 00:06:42,256
And, whoa.

167
00:06:42,256 --> 00:06:43,120
Ohh.

168
00:06:43,120 --> 00:06:45,370
We had our first collision.

169
00:06:45,370 --> 00:06:47,380
Crash in turn one.

170
00:06:47,380 --> 00:06:48,730
Oh boy.

171
00:06:48,730 --> 00:06:51,030
That is, you know,
that's a bold maneuver.

172
00:06:54,376 --> 00:06:55,130
[END PLAYBACK]

173
00:06:55,130 --> 00:06:58,090
JOHN LEONARD: So what
actually happened?

174
00:06:58,090 --> 00:06:59,630
So it turned out
Cornell were having

175
00:06:59,630 --> 00:07:00,770
problems with their actuators.

176
00:07:00,770 --> 00:07:02,420
They were sort of stopping
and starting and stopping

177
00:07:02,420 --> 00:07:02,961
and starting.

178
00:07:02,961 --> 00:07:04,910
And we had some problems.

179
00:07:04,910 --> 00:07:06,480
It turned out we
had about five bugs.

180
00:07:06,480 --> 00:07:08,750
They had about five
bugs that interacted.

181
00:07:08,750 --> 00:07:10,452
And here's a computer's eye--

182
00:07:10,452 --> 00:07:11,910
sort of, brain of
the robot's view.

183
00:07:11,910 --> 00:07:14,750
Now back in '07, we weren't
using a lot of vision

184
00:07:14,750 --> 00:07:17,240
for object detection
and classification.

185
00:07:17,240 --> 00:07:18,680
So with the laser scanner--

186
00:07:18,680 --> 00:07:20,510
the Cornell vehicle's there.

187
00:07:20,510 --> 00:07:21,519
It has a license plate.

188
00:07:21,519 --> 00:07:22,310
It has tail lights.

189
00:07:22,310 --> 00:07:23,269
It has a big number 26.

190
00:07:23,269 --> 00:07:24,476
It's on the middle of a road.

191
00:07:24,476 --> 00:07:25,680
We should know that's a car.

192
00:07:25,680 --> 00:07:26,504
Stay away from it.

193
00:07:26,504 --> 00:07:27,920
But to the laser
scanner it's just

194
00:07:27,920 --> 00:07:29,680
a blob of laser scanner data.

195
00:07:29,680 --> 00:07:33,440
And even when we pull
around the side of the car

196
00:07:33,440 --> 00:07:36,050
we weren't clever enough
with our algorithms

197
00:07:36,050 --> 00:07:37,554
to fill in the fact
that it's a car.

198
00:07:37,554 --> 00:07:39,470
And you have the problem
when it starts moving

199
00:07:39,470 --> 00:07:40,605
of the aperture problem--

200
00:07:40,605 --> 00:07:42,230
that as you're moving,
and it's moving,

201
00:07:42,230 --> 00:07:44,990
it's very hard to tell and
deduce the true motion.

202
00:07:44,990 --> 00:07:48,250
Now, another thing that
happened was we had a threshold.

203
00:07:48,250 --> 00:07:51,200
And so in our
150,000 lines of code

204
00:07:51,200 --> 00:07:52,790
our wonderfully
gifted student, who's

205
00:07:52,790 --> 00:07:55,280
now a tenured professor
at Michigan, Ed Olson,

206
00:07:55,280 --> 00:07:57,500
had a threshold of
3 meters per second.

207
00:07:57,500 --> 00:08:00,110
So anything moving faster
than 3 meters per second

208
00:08:00,110 --> 00:08:01,400
could be a car.

209
00:08:01,400 --> 00:08:03,920
Anything less than 3 meters
per second couldn't be a car.

210
00:08:03,920 --> 00:08:05,436
Now that might
seem kind of silly.

211
00:08:05,436 --> 00:08:07,310
But it turns out that
slowly moving obstacles

212
00:08:07,310 --> 00:08:08,960
are much harder to
detect and classify

213
00:08:08,960 --> 00:08:10,460
than fast moving obstacles.

214
00:08:10,460 --> 00:08:12,470
That's one reason that
city driving or driving,

215
00:08:12,470 --> 00:08:15,710
say, in a shopping mall parking
lot is actually in many ways

216
00:08:15,710 --> 00:08:17,990
more challenging than
driving on the highway.

217
00:08:17,990 --> 00:08:22,707
And so despite our best efforts
to stop at the last minute,

218
00:08:22,707 --> 00:08:25,040
we steered into the car and
had this little minor fender

219
00:08:25,040 --> 00:08:26,120
bender.

220
00:08:26,120 --> 00:08:28,490
But one thing that we did
is we made all our data

221
00:08:28,490 --> 00:08:29,480
available open source.

222
00:08:29,480 --> 00:08:31,105
And we actually wrote
a journal article

223
00:08:31,105 --> 00:08:35,159
on this incident
and a few others.

224
00:08:35,159 --> 00:08:37,520
And so if you'd asked
me then in 2007,

225
00:08:37,520 --> 00:08:39,720
I would have said we're
a long way from turning

226
00:08:39,720 --> 00:08:41,659
your car loose on
the streets of Boston

227
00:08:41,659 --> 00:08:43,580
with absolutely no user input.

228
00:08:43,580 --> 00:08:47,400
And the real challenge is our
uncertainty and robustness

229
00:08:47,400 --> 00:08:50,410
and developing robust
systems that really work.

230
00:08:50,410 --> 00:08:53,240
But for our system, some of the
algorithm progress we made--

231
00:08:53,240 --> 00:08:54,860
I mentioned the lane tracking.

232
00:08:54,860 --> 00:08:59,270
Albert Wang, who's now, I think,
working at Google, developed--

233
00:08:59,270 --> 00:09:01,070
was given very sparse--

234
00:09:01,070 --> 00:09:03,532
I'd say about 10% of
the recent graduates

235
00:09:03,532 --> 00:09:05,240
or more are working
at Google these days.

236
00:09:05,240 --> 00:09:06,380
AUDIENCE: Albert's
at [INAUDIBLE]..

237
00:09:06,380 --> 00:09:07,120
JOHN LEONARD: Oh.

238
00:09:07,120 --> 00:09:08,090
OK.

239
00:09:08,090 --> 00:09:16,100
And then here is a video
for the qualifying event

240
00:09:16,100 --> 00:09:17,570
to get into the final race.

241
00:09:17,570 --> 00:09:19,790
We had to navigate-- whoops,
I can't press the mouse.

242
00:09:19,790 --> 00:09:20,690
That's going to stop.

243
00:09:20,690 --> 00:09:26,990
So we had to navigate
along a curved road

244
00:09:26,990 --> 00:09:28,340
with very sparse waypoints.

245
00:09:28,340 --> 00:09:30,022
And so, in real
time the computer

246
00:09:30,022 --> 00:09:31,730
has to make decisions
about what it sees.

247
00:09:31,730 --> 00:09:32,480
Where is the road?

248
00:09:32,480 --> 00:09:33,440
Where am I?

249
00:09:33,440 --> 00:09:34,602
Are there obstacles?

250
00:09:34,602 --> 00:09:36,560
And there are no parked
cars in this situation,

251
00:09:36,560 --> 00:09:38,360
but other stretches
had parked cars.

252
00:09:38,360 --> 00:09:41,021
And our car-- in a nutshell,
if our robot became

253
00:09:41,021 --> 00:09:43,020
confused about where the
road was it would stop.

254
00:09:43,020 --> 00:09:44,930
It would have to wait
and get its courage up,

255
00:09:44,930 --> 00:09:47,741
like lowering its
thresholds as it was stuck.

256
00:09:47,741 --> 00:09:49,490
But we were the only
team to our knowledge

257
00:09:49,490 --> 00:09:52,530
to qualify without
actually adding waypoints.

258
00:09:52,530 --> 00:09:54,030
So it turns out the
other top teams,

259
00:09:54,030 --> 00:09:56,300
they just went in with
a Google satellite image

260
00:09:56,300 --> 00:09:59,280
and just added a breadcrumb
trail for the robot to follow,

261
00:09:59,280 --> 00:10:01,100
simplifying the perception.

262
00:10:01,100 --> 00:10:02,970
So this was back in '07.

263
00:10:02,970 --> 00:10:05,700
Now let's fast forward to 2015.

264
00:10:05,700 --> 00:10:09,470
And right now-- so
of course, we have

265
00:10:09,470 --> 00:10:11,390
the Google self-driving
car which has just

266
00:10:11,390 --> 00:10:13,550
been an amazing project.

267
00:10:13,550 --> 00:10:15,590
And you've all probably
seen these videos,

268
00:10:15,590 --> 00:10:17,960
each with millions
of hits on YouTube.

269
00:10:17,960 --> 00:10:22,460
The earlier one of taking
a blind person for a ride

270
00:10:22,460 --> 00:10:26,090
to Taco Bell, this was
driving-- that was 2012, city

271
00:10:26,090 --> 00:10:30,080
streets in 2014, spring 2015.

272
00:10:30,080 --> 00:10:32,960
And then the new
Google car, which

273
00:10:32,960 --> 00:10:36,020
won't have a steering wheel
in its final instantiation,

274
00:10:36,020 --> 00:10:37,237
won't have pedals.

275
00:10:37,237 --> 00:10:38,570
It will just have a stop button.

276
00:10:38,570 --> 00:10:40,880
And that's your analogy
to the elevator.

277
00:10:40,880 --> 00:10:46,070
And so I think that the Google
car is an amazing research

278
00:10:46,070 --> 00:10:50,990
project that might one
day transform mobility.

279
00:10:50,990 --> 00:10:52,900
But I do think,
with all sincerity--

280
00:10:52,900 --> 00:10:55,310
so I rode in the
Google car last summer.

281
00:10:55,310 --> 00:10:56,670
I was blown away.

282
00:10:56,670 --> 00:10:58,580
I felt like I was on
the beach at Kitty Hawk.

283
00:10:58,580 --> 00:11:01,370
It's like this just
really profound technology

284
00:11:01,370 --> 00:11:03,920
that could in the long term
have a very big impact.

285
00:11:03,920 --> 00:11:05,720
And I have amazing
respect for that team--

286
00:11:05,720 --> 00:11:08,270
Chris Urmson, Mike
Montemerlo, et cetera.

287
00:11:08,270 --> 00:11:11,180
But I think in the media and
in others, the technology

288
00:11:11,180 --> 00:11:14,480
has been a bit overhyped, and
it's poorly misunderstood.

289
00:11:14,480 --> 00:11:16,970
And a lot of it goes down to
how the car localizes itself,

290
00:11:16,970 --> 00:11:18,920
how it uses prior
maps, and how they

291
00:11:18,920 --> 00:11:21,080
simplify the task of driving.

292
00:11:21,080 --> 00:11:22,820
And so even though
people like Musk

293
00:11:22,820 --> 00:11:24,870
have said driving
is a solved problem,

294
00:11:24,870 --> 00:11:26,870
I think we have to be
aware that just because it

295
00:11:26,870 --> 00:11:30,200
works for Google, doesn't mean
it'll work for everybody else.

296
00:11:30,200 --> 00:11:33,740
So critical differences between
Google and, say, everyone else.

297
00:11:33,740 --> 00:11:35,720
And this is with all
respect to all players.

298
00:11:35,720 --> 00:11:36,886
I'm not trying to criticize.

299
00:11:36,886 --> 00:11:40,430
It's more just trying
to balance the debate.

300
00:11:40,430 --> 00:11:42,680
The Google car
localizes on the left

301
00:11:42,680 --> 00:11:45,440
with a prior map, where they
map the lighter intensity off

302
00:11:45,440 --> 00:11:47,030
of the ground surface.

303
00:11:47,030 --> 00:11:49,070
And they will annotate
the map by hand--

304
00:11:49,070 --> 00:11:52,097
adding pedestrian crossings,
adding stoplights.

305
00:11:52,097 --> 00:11:53,930
They'll drive a car
around many, many times,

306
00:11:53,930 --> 00:11:57,146
and then do a SLAM process
to optimize the map.

307
00:11:57,146 --> 00:11:58,520
But if the world
changes, they're

308
00:11:58,520 --> 00:11:59,811
going to have to adapt to that.

309
00:11:59,811 --> 00:12:03,200
Now, they've shown the ability
to do response to construction,

310
00:12:03,200 --> 00:12:04,772
bicyclists with hand signals.

311
00:12:04,772 --> 00:12:06,980
When I was in the car we
crossed the railroad tracks.

312
00:12:06,980 --> 00:12:07,938
That just blew me away.

313
00:12:07,938 --> 00:12:11,570
I mean, it's pretty
impressive capability but more

314
00:12:11,570 --> 00:12:14,330
a vision-based approach that
just follows the lane markings.

315
00:12:14,330 --> 00:12:16,890
If the lane markings are
good, everything's fine.

316
00:12:16,890 --> 00:12:19,035
In fact, Tesla either
just have released--

317
00:12:19,035 --> 00:12:21,410
or are about to release--
their autopilot software, which

318
00:12:21,410 --> 00:12:23,480
is an advanced lane
keeping system.

319
00:12:23,480 --> 00:12:26,090
And Elon Musk, a few weeks
ago, posted on Twitter

320
00:12:26,090 --> 00:12:29,240
that there's one last
corner case for us to fix.

321
00:12:29,240 --> 00:12:32,330
And apparently he-- on part of
his commute in the Los Angeles

322
00:12:32,330 --> 00:12:34,940
area there is well
defined lane markings.

323
00:12:34,940 --> 00:12:37,790
And part of it is a concrete
road with weeds and skid marks

324
00:12:37,790 --> 00:12:38,720
and so forth.

325
00:12:38,720 --> 00:12:41,630
And he said publicly that the
system works well if the lane

326
00:12:41,630 --> 00:12:43,040
markings are well-defined.

327
00:12:43,040 --> 00:12:45,140
But for more challenging
vision conditions

328
00:12:45,140 --> 00:12:47,690
like looking into the sun
it doesn't work as well.

329
00:12:47,690 --> 00:12:49,910
And so the critical
difference is

330
00:12:49,910 --> 00:12:52,640
if you're going to use
the LiDAR with prior maps,

331
00:12:52,640 --> 00:12:54,740
you can do very precise
localization down

332
00:12:54,740 --> 00:12:57,200
to less than 10
centimeters accuracy.

333
00:12:57,200 --> 00:13:00,380
And the way I think about
it is robot navigation

334
00:13:00,380 --> 00:13:02,570
is about three things--

335
00:13:02,570 --> 00:13:04,790
where do you want
the robot to be?

336
00:13:04,790 --> 00:13:06,770
Where does the
robot think it is?

337
00:13:06,770 --> 00:13:08,810
And where really is the robot?

338
00:13:08,810 --> 00:13:13,010
And when the robot
thinks it's somewhere,

339
00:13:13,010 --> 00:13:15,559
but it's really somewhere
different, that's really bad.

340
00:13:15,559 --> 00:13:16,100
That happens.

341
00:13:16,100 --> 00:13:19,010
We've lost underwater vehicles
and had very nervous searches

342
00:13:19,010 --> 00:13:20,090
to find them--

343
00:13:20,090 --> 00:13:23,780
luckily-- when the
robot made a mistake.

344
00:13:23,780 --> 00:13:26,150
And so with the Google
approach they really

345
00:13:26,150 --> 00:13:29,000
nail this "where am I" problem--
the localization problem.

346
00:13:29,000 --> 00:13:30,871
But it means having
an expensive LiDar.

347
00:13:30,871 --> 00:13:32,120
It means having accurate maps.

348
00:13:32,120 --> 00:13:33,740
It means maintaining them.

349
00:13:33,740 --> 00:13:36,170
One critical distinction
is between level four

350
00:13:36,170 --> 00:13:37,220
and level three.

351
00:13:37,220 --> 00:13:38,660
These are definitions
of autonomy

352
00:13:38,660 --> 00:13:40,970
from the US
government-- from NTSA.

353
00:13:40,970 --> 00:13:42,950
A level four car
is what Google are

354
00:13:42,950 --> 00:13:44,780
trying to do now,
which is really,

355
00:13:44,780 --> 00:13:46,460
you just-- you
could go to sleep.

356
00:13:46,460 --> 00:13:48,530
The car has a 100% control.

357
00:13:48,530 --> 00:13:50,350
You couldn't intervene
if you wanted to.

358
00:13:50,350 --> 00:13:51,350
You just press a button.

359
00:13:51,350 --> 00:13:51,850
Go to sleep.

360
00:13:51,850 --> 00:13:53,480
Wake up at your destination.

361
00:13:53,480 --> 00:13:56,120
Musk has said that he
thinks within five years

362
00:13:56,120 --> 00:13:59,450
you can go to sleep in your
car, which to me I just--

363
00:13:59,450 --> 00:14:01,670
five decades would
impress me, to be honest.

364
00:14:04,710 --> 00:14:08,330
But level three is when the car
is going to do most of the job,

365
00:14:08,330 --> 00:14:10,880
but you have to take over
if something goes wrong.

366
00:14:10,880 --> 00:14:15,290
And for example Delphi drove
99% of the way across the US

367
00:14:15,290 --> 00:14:18,200
in spring of this year,
which is pretty impressive.

368
00:14:18,200 --> 00:14:20,660
But 50 miles had to
be driven by people--

369
00:14:20,660 --> 00:14:23,027
getting on and off of
highways and city streets.

370
00:14:23,027 --> 00:14:24,860
And so there's something
about human nature,

371
00:14:24,860 --> 00:14:27,110
and the way humans interact
with autonomous systems,

372
00:14:27,110 --> 00:14:31,190
that it's actually kind of hard
for a person to pay attention.

373
00:14:31,190 --> 00:14:36,470
Imagine if 99% of the time
the car does it perfectly.

374
00:14:36,470 --> 00:14:38,780
But 1% of the time it's
about to make a mistake,

375
00:14:38,780 --> 00:14:41,510
and you have to be
alert to take over.

376
00:14:41,510 --> 00:14:44,240
And research experience
from aviation

377
00:14:44,240 --> 00:14:46,400
has shown that humans
are actually bad at that.

378
00:14:46,400 --> 00:14:49,750
And another issue
is-- and this is--

379
00:14:49,750 --> 00:14:52,400
I mean, Mountainview is pretty
complicated-- lots of cyclists,

380
00:14:52,400 --> 00:14:54,590
pedestrians, I mentioned
the railroad crossings,

381
00:14:54,590 --> 00:14:55,930
construction.

382
00:14:55,930 --> 00:14:58,790
But in California they've
had this historic drought.

383
00:14:58,790 --> 00:15:02,340
And most of the testing has been
done with no rain, for example,

384
00:15:02,340 --> 00:15:03,670
and no snow.

385
00:15:03,670 --> 00:15:06,370
And if you think about
Boston and Boston roads,

386
00:15:06,370 --> 00:15:08,540
there are some pretty
challenging situations.

387
00:15:08,540 --> 00:15:11,500
And so for myself, when I
first-- a couple of years ago

388
00:15:11,500 --> 00:15:14,020
I said I didn't expect
a taxi in Manhattan

389
00:15:14,020 --> 00:15:16,090
in my lifetime-- a
fully autonomous taxi--

390
00:15:16,090 --> 00:15:17,470
to go anywhere in Manhattan.

391
00:15:17,470 --> 00:15:19,570
And I got criticized
online for saying that.

392
00:15:19,570 --> 00:15:24,160
So I put a dash cam on
my car, and actually

393
00:15:24,160 --> 00:15:27,160
had my son record
cell phone footage.

394
00:15:27,160 --> 00:15:28,870
The upper left is
making a left turn

395
00:15:28,870 --> 00:15:30,929
near my house in Newton, Mass.

396
00:15:30,929 --> 00:15:32,470
And if you look to
the right, there's

397
00:15:32,470 --> 00:15:34,600
cars as far as the eye can see.

398
00:15:34,600 --> 00:15:36,100
And if you look to
the left, there's

399
00:15:36,100 --> 00:15:38,650
cars coming at pretty high
rate of speed, with a mailbox,

400
00:15:38,650 --> 00:15:39,730
and a tree.

401
00:15:39,730 --> 00:15:44,590
And this is a really challenging
behavior for a human,

402
00:15:44,590 --> 00:15:47,620
because it requires making
a decision in real time.

403
00:15:47,620 --> 00:15:50,140
We want very high reliability
in terms of detecting

404
00:15:50,140 --> 00:15:51,730
the cars coming from the left.

405
00:15:51,730 --> 00:15:53,560
But the way that I
pulled out is to wave

406
00:15:53,560 --> 00:15:55,210
at a person in another car.

407
00:15:55,210 --> 00:15:58,140
And those sort of
nods and waves--

408
00:15:58,140 --> 00:15:59,890
they're some of the
most challenging forms

409
00:15:59,890 --> 00:16:02,596
of human-computer interaction.

410
00:16:02,596 --> 00:16:03,970
So imagine vision
algorithms that

411
00:16:03,970 --> 00:16:08,260
could detect a person nodding
at you from the other direction.

412
00:16:08,260 --> 00:16:09,730
Or here's another situation.

413
00:16:09,730 --> 00:16:12,645
This is going through
Coolidge Corner in Brookline.

414
00:16:12,645 --> 00:16:14,770
And I'll show a longer
version of this in a second.

415
00:16:14,770 --> 00:16:15,820
But the light's green.

416
00:16:15,820 --> 00:16:17,836
And see here-- this
police officer?

417
00:16:17,836 --> 00:16:19,960
So despite the green light,
the police officer just

418
00:16:19,960 --> 00:16:23,350
raises their hand, and that
means the signal to stop.

419
00:16:23,350 --> 00:16:27,190
And so interacting with
crossing guards and people--

420
00:16:27,190 --> 00:16:29,290
very challenging,
as well as changes

421
00:16:29,290 --> 00:16:32,460
to the road surface and,
of course, adverse weather.

422
00:16:32,460 --> 00:16:36,401
And so here's a longer sequence
for that police officer.

423
00:16:36,401 --> 00:16:38,650
First of all, you'll see
flashing lights on the left--

424
00:16:38,650 --> 00:16:40,829
which may be flashing
lights, you should pull over.

425
00:16:40,829 --> 00:16:42,370
Here you should just
drive past them.

426
00:16:42,370 --> 00:16:43,744
It's just the cop
left his lights

427
00:16:43,744 --> 00:16:45,400
on when he parked his car.

428
00:16:45,400 --> 00:16:46,780
But the light's red.

429
00:16:46,780 --> 00:16:48,550
And this police
officer is waving me

430
00:16:48,550 --> 00:16:50,380
through a red
light, which I think

431
00:16:50,380 --> 00:16:51,640
is a really advanced behavior.

432
00:16:51,640 --> 00:16:53,740
So imagine a car that's--

433
00:16:53,740 --> 00:16:56,710
imagine the logic for
OK, stop at red lights

434
00:16:56,710 --> 00:16:59,200
unless there's a police
officer waving you through it,

435
00:16:59,200 --> 00:17:00,490
and how you get that reliable.

436
00:17:00,490 --> 00:17:02,823
And now we're going to pull
up to the next intersection,

437
00:17:02,823 --> 00:17:06,290
and this police officer is
going to stop at a green light.

438
00:17:06,290 --> 00:17:08,710
And so despite all the
recent progress in vision,

439
00:17:08,710 --> 00:17:11,260
things like image
labeling, ImageNet--

440
00:17:11,260 --> 00:17:15,310
most of those systems are
trained with vast archives

441
00:17:15,310 --> 00:17:18,369
of images from the internet
where there's no context.

442
00:17:18,369 --> 00:17:21,780
And they're so challenging
for even humans to classify.

443
00:17:21,780 --> 00:17:24,490
So that if you had
some data sets,

444
00:17:24,490 --> 00:17:26,230
like the Caltech
pedestrian data set,

445
00:17:26,230 --> 00:17:29,830
if you got 78% performance,
that's really good.

446
00:17:29,830 --> 00:17:34,450
But we need 99.9999%
or better performance

447
00:17:34,450 --> 00:17:36,400
before we're going
to turn cars loose

448
00:17:36,400 --> 00:17:39,820
in the wild in these
challenging situations.

449
00:17:39,820 --> 00:17:42,689
Now going back more to
localization and mapping.

450
00:17:42,689 --> 00:17:44,230
Here I collected
data for about three

451
00:17:44,230 --> 00:17:46,672
or four weeks of my commuting.

452
00:17:46,672 --> 00:17:47,755
This is crossing the Mass.

453
00:17:47,755 --> 00:17:51,040
Ave. Bridge going from
Boston into Cambridge.

454
00:17:51,040 --> 00:17:52,540
And the lighting
is a little tricky.

455
00:17:52,540 --> 00:17:54,760
But tell me what's
different between the top

456
00:17:54,760 --> 00:17:56,480
and the bottom video.

457
00:17:59,310 --> 00:18:02,560
And notice, by the way, how
close we come to this truck.

458
00:18:02,560 --> 00:18:04,900
The slightest angular error
in your position estimate,

459
00:18:04,900 --> 00:18:07,670
really bad things could happen.

460
00:18:07,670 --> 00:18:10,300
But the top-- this
is a long weekend.

461
00:18:10,300 --> 00:18:11,784
This is Veterans Day weekend.

462
00:18:11,784 --> 00:18:12,700
They repaved the Mass.

463
00:18:12,700 --> 00:18:13,200
Ave. Bridge.

464
00:18:13,200 --> 00:18:15,970
So on the bottom, the
lane lines are gone.

465
00:18:15,970 --> 00:18:18,430
And so if you had an
appearance-based localization

466
00:18:18,430 --> 00:18:20,140
algorithm like
Google's, you would

467
00:18:20,140 --> 00:18:22,629
need to remap the bridge
before you drove on it.

468
00:18:22,629 --> 00:18:23,920
But the lines aren't there yet.

469
00:18:23,920 --> 00:18:25,550
And how well is
it going to work?

470
00:18:25,550 --> 00:18:28,490
And so, this is just a
really tricky situation.

471
00:18:28,490 --> 00:18:30,140
And, of course, there's weather.

472
00:18:30,140 --> 00:18:32,290
Now, snow is
difficult for things

473
00:18:32,290 --> 00:18:33,910
like traction and control.

474
00:18:33,910 --> 00:18:36,730
But for perception, if you look
at how the Google car actually

475
00:18:36,730 --> 00:18:37,630
works--

476
00:18:37,630 --> 00:18:39,130
if you're going to
localize yourself

477
00:18:39,130 --> 00:18:44,080
based on precisely knowing
the car's position down

478
00:18:44,080 --> 00:18:47,590
to centimeters so that you can
predict what you should see,

479
00:18:47,590 --> 00:18:49,224
then if you can't
see the road surface

480
00:18:49,224 --> 00:18:50,890
you're not going to
be able to localize.

481
00:18:50,890 --> 00:18:53,770
And so this is just a
reminder of the sorts of maps

482
00:18:53,770 --> 00:18:55,420
that Google uses.

483
00:18:55,420 --> 00:18:58,510
So I think to make it to really
challenging weather and very

484
00:18:58,510 --> 00:19:00,939
complex environments, we need
a higher level understanding

485
00:19:00,939 --> 00:19:01,480
of the world.

486
00:19:01,480 --> 00:19:04,450
I think more a semantic or
object-based understanding

487
00:19:04,450 --> 00:19:05,660
of the world.

488
00:19:05,660 --> 00:19:08,680
And then, of course, there's
difficulties in perception.

489
00:19:08,680 --> 00:19:11,400
And so what do you
see in this picture?

490
00:19:16,010 --> 00:19:17,800
The sun?

491
00:19:17,800 --> 00:19:19,326
There's a green light there.

492
00:19:19,326 --> 00:19:20,950
I realize the lighting
is really harsh,

493
00:19:20,950 --> 00:19:22,990
and maybe you could do
polarization or something

494
00:19:22,990 --> 00:19:23,490
better.

495
00:19:23,490 --> 00:19:27,220
But does anyone see the
traffic cop standing there?

496
00:19:27,220 --> 00:19:29,200
You can just make out his legs.

497
00:19:29,200 --> 00:19:32,230
There's a policeman there
who gave me this little wave,

498
00:19:32,230 --> 00:19:34,300
even though I was sort
of blinded by the sun.

499
00:19:34,300 --> 00:19:35,560
And he walked out and
put his back to me

500
00:19:35,560 --> 00:19:37,060
and was waving
pedestrians across,

501
00:19:37,060 --> 00:19:38,540
even though the light was green.

502
00:19:38,540 --> 00:19:41,020
So a purely
vision-based system is

503
00:19:41,020 --> 00:19:45,980
going to just need dramatic
leaps in visual performance.

504
00:19:45,980 --> 00:19:48,250
So to wrap up the
self-driving car part,

505
00:19:48,250 --> 00:19:51,550
I think the big questions going
forward-- technical challenges,

506
00:19:51,550 --> 00:19:55,990
maintaining the maps,
dealing with adverse weather,

507
00:19:55,990 --> 00:19:57,280
interacting with people--

508
00:19:57,280 --> 00:19:59,740
both inside and
outside of the car--

509
00:19:59,740 --> 00:20:02,860
and then getting truly robust
computer vision algorithms.

510
00:20:02,860 --> 00:20:05,440
We want to get in a totally
different place on the ROC

511
00:20:05,440 --> 00:20:07,750
curves, or the
precision recall curves,

512
00:20:07,750 --> 00:20:11,830
where approaching perfect
detection with no false alarms.

513
00:20:11,830 --> 00:20:14,930
And that's a really
hard thing to do.

514
00:20:14,930 --> 00:20:16,990
So I've worked my whole
life on the robot mapping

515
00:20:16,990 --> 00:20:18,160
and localization problem.

516
00:20:18,160 --> 00:20:19,960
And for this audience
I wanted to just

517
00:20:19,960 --> 00:20:22,800
ask you a little question.

518
00:20:22,800 --> 00:20:25,920
Does anyone know what the
2014 Nobel Prize in medicine

519
00:20:25,920 --> 00:20:28,701
or physiology was for?

520
00:20:28,701 --> 00:20:29,200
Anybody?

521
00:20:29,200 --> 00:20:30,755
AUDIENCE: [INAUDIBLE]

522
00:20:30,755 --> 00:20:31,630
AUDIENCE: Grid cells.

523
00:20:31,630 --> 00:20:32,490
JOHN LEONARD: Grid cells.

524
00:20:32,490 --> 00:20:33,770
Grid cells and place cells.

525
00:20:33,770 --> 00:20:38,030
And so this has been
called SLAM in the brain.

526
00:20:38,030 --> 00:20:38,950
Now, you might argue.

527
00:20:38,950 --> 00:20:41,920
And we might be very
far from knowing.

528
00:20:41,920 --> 00:20:44,474
But I think it's just
really exciting to--

529
00:20:44,474 --> 00:20:45,640
so for myself, I'll explain.

530
00:20:45,640 --> 00:20:47,680
I've had what's called
an ONR MURI grant--

531
00:20:47,680 --> 00:20:50,610
multidisciplinary university
research initiative grant--

532
00:20:50,610 --> 00:20:52,510
with Mike Hasselmo
and his colleagues

533
00:20:52,510 --> 00:20:53,980
at Boston University.

534
00:20:53,980 --> 00:20:56,560
And these are a couple
of Mike's videos.

535
00:20:56,560 --> 00:20:59,650
And so, I think Matt
Wilson spoke to your group.

536
00:20:59,650 --> 00:21:04,720
And the notion that in
the entorhinal cortex

537
00:21:04,720 --> 00:21:06,994
that there is this sort
of position information

538
00:21:06,994 --> 00:21:08,410
that's very metrical,
and it seems

539
00:21:08,410 --> 00:21:10,390
to be at the heart
of memory formation,

540
00:21:10,390 --> 00:21:14,150
to me is very powerful
and very important.

541
00:21:14,150 --> 00:21:17,570
And so, we have this underlying
question of representation.

542
00:21:17,570 --> 00:21:18,940
How do we represent the world?

543
00:21:18,940 --> 00:21:24,130
And I believe location
is just absolutely

544
00:21:24,130 --> 00:21:27,057
vital to building
memories and to developing

545
00:21:27,057 --> 00:21:28,390
advanced reasoning in the world.

546
00:21:28,390 --> 00:21:31,720
And the fact that
grid cells exist--

547
00:21:31,720 --> 00:21:34,240
to me-- and they have this
role in memory formation

548
00:21:34,240 --> 00:21:37,390
is just this really
exciting concept.

549
00:21:37,390 --> 00:21:41,800
And so, in robotics
we call the problem

550
00:21:41,800 --> 00:21:44,710
of how a robot builds a map
and uses that map to navigate,

551
00:21:44,710 --> 00:21:47,270
SLAM-- simultaneous
localization and mapping.

552
00:21:47,270 --> 00:21:50,470
This is for a PR2 robot being
driven around the second floor

553
00:21:50,470 --> 00:21:52,420
of our building, not far
from Patrick's office

554
00:21:52,420 --> 00:21:54,430
if you recognize any of that.

555
00:21:54,430 --> 00:21:59,110
And this is using stereo vision.

556
00:21:59,110 --> 00:22:01,000
My PhD student,
Hordur Johannsson,

557
00:22:01,000 --> 00:22:02,620
who graduated a
couple of years ago,

558
00:22:02,620 --> 00:22:05,110
created a system to
do real time SLAM

559
00:22:05,110 --> 00:22:06,790
and try to address
how to get temporally

560
00:22:06,790 --> 00:22:08,829
scalable representations.

561
00:22:08,829 --> 00:22:10,870
And one thing you'll see
as the robot goes around

562
00:22:10,870 --> 00:22:12,340
occasionally is
loop closing, where

563
00:22:12,340 --> 00:22:14,381
the robot might come back
and have like, an error

564
00:22:14,381 --> 00:22:15,890
and then correct that error.

565
00:22:15,890 --> 00:22:18,370
So this is the part of the
SLAM problem that in some ways

566
00:22:18,370 --> 00:22:20,050
is well understood
in robotics, which

567
00:22:20,050 --> 00:22:25,390
is how you detect features from
images, track them over time,

568
00:22:25,390 --> 00:22:28,480
and try to bootstrap up,
building a representation

569
00:22:28,480 --> 00:22:30,805
and using that to
locate your estimation.

570
00:22:30,805 --> 00:22:33,370
And I've worked on
this my whole career.

571
00:22:33,370 --> 00:22:38,020
And as a grad student at Oxford,
I had very primitive sensors.

572
00:22:38,020 --> 00:22:41,140
So for a historical SLAM talk I
recently digitized an old video

573
00:22:41,140 --> 00:22:42,650
and some old pictures.

574
00:22:42,650 --> 00:22:46,020
This was in the basement of the
engineering building at Oxford.

575
00:22:46,020 --> 00:22:49,369
This is just the localization
part of how you have a map,

576
00:22:49,369 --> 00:22:51,160
and you generate
predictions-- in this case

577
00:22:51,160 --> 00:22:53,020
for sonar measurements.

578
00:22:53,020 --> 00:22:55,620
And at the time there we had--

579
00:22:55,620 --> 00:22:57,190
I'm sitting at a
SUN workstation.

580
00:22:57,190 --> 00:22:59,320
To my left is something
called a data cube,

581
00:22:59,320 --> 00:23:02,320
which for about $100,000
could just barely

582
00:23:02,320 --> 00:23:07,180
do like real time frame grabbing
and then edge detection out.

583
00:23:07,180 --> 00:23:09,730
And so vision just wasn't ready.

584
00:23:09,730 --> 00:23:12,157
And the exciting
thing now in our field

585
00:23:12,157 --> 00:23:13,990
is vision is ready--
that we're really using

586
00:23:13,990 --> 00:23:15,622
vision in a substantial way.

587
00:23:15,622 --> 00:23:17,080
But I think a lot
about prediction.

588
00:23:17,080 --> 00:23:18,850
If you know your
position, you can

589
00:23:18,850 --> 00:23:21,430
predict what you should see
and create a feedback loop.

590
00:23:21,430 --> 00:23:23,310
And that's sort of what
we're trying to do.

591
00:23:23,310 --> 00:23:27,760
And so SLAM is a
wonderful problem,

592
00:23:27,760 --> 00:23:30,790
I believe, for addressing a
whole great set of questions,

593
00:23:30,790 --> 00:23:32,980
because there are these
different axes of difficulty

594
00:23:32,980 --> 00:23:34,940
that interact with one another.

595
00:23:34,940 --> 00:23:36,210
And one is representation.

596
00:23:36,210 --> 00:23:37,460
How do we represent the world?

597
00:23:37,460 --> 00:23:39,280
And I think that
question-- we still have

598
00:23:39,280 --> 00:23:41,330
a ton of things to think about.

599
00:23:41,330 --> 00:23:42,310
Another is inference.

600
00:23:42,310 --> 00:23:43,690
We want to do real
time inference

601
00:23:43,690 --> 00:23:45,910
about what's where in the
world and how we combine it

602
00:23:45,910 --> 00:23:47,410
all together.

603
00:23:47,410 --> 00:23:49,870
And finally, there's a systems
in autonomy access, where

604
00:23:49,870 --> 00:23:52,030
we want to build systems,
and deploy them, and have

605
00:23:52,030 --> 00:23:55,900
them operate robustly and
reliably in the world.

606
00:23:55,900 --> 00:23:59,920
So in SLAM, here's an
example of how we pose

607
00:23:59,920 --> 00:24:01,720
this as an inference problem.

608
00:24:01,720 --> 00:24:05,340
This is from the classic
Victoria Park data

609
00:24:05,340 --> 00:24:07,600
set from Sydney University.

610
00:24:07,600 --> 00:24:10,720
A robot drives around, in this
case, a park with some trees.

611
00:24:10,720 --> 00:24:12,800
There are landmarks
shown in green.

612
00:24:12,800 --> 00:24:14,622
The robot's positioner
drifts over time.

613
00:24:14,622 --> 00:24:15,830
We have dead reckoning error.

614
00:24:15,830 --> 00:24:17,820
That's shown in blue.

615
00:24:17,820 --> 00:24:20,840
And we estimate the trajectory
of the robot in red,

616
00:24:20,840 --> 00:24:22,250
and the position
of the landmarks

617
00:24:22,250 --> 00:24:23,349
from relative measurement.

618
00:24:23,349 --> 00:24:24,890
So as you take
relative measurements,

619
00:24:24,890 --> 00:24:26,348
and you move through
the world, how

620
00:24:26,348 --> 00:24:27,910
do you put that all together?

621
00:24:27,910 --> 00:24:29,990
And so we, cast this
as an inference problem

622
00:24:29,990 --> 00:24:33,680
where we have the robot
poses, the odometric inputs,

623
00:24:33,680 --> 00:24:36,350
landmarks-- you can do it
with or without landmarks--

624
00:24:36,350 --> 00:24:37,450
and measurements.

625
00:24:37,450 --> 00:24:40,280
And an interesting thing-- so
we have this inference problem

626
00:24:40,280 --> 00:24:41,700
on a belief network.

627
00:24:41,700 --> 00:24:44,150
The key thing about SLAM is
it's building up over time.

628
00:24:44,150 --> 00:24:46,280
So you start with nothing
and the problem's growing

629
00:24:46,280 --> 00:24:47,480
ever larger.

630
00:24:47,480 --> 00:24:50,450
And, let's see,
if I had to say--

631
00:24:50,450 --> 00:24:53,060
25 years of thinking about
this up through 2012,

632
00:24:53,060 --> 00:24:55,130
the most important
thing I learned

633
00:24:55,130 --> 00:24:58,520
is that maintaining sparsity in
the underlying representation

634
00:24:58,520 --> 00:24:59,150
is critical.

635
00:24:59,150 --> 00:25:00,650
And, in fact, for
biological systems

636
00:25:00,650 --> 00:25:03,290
I wonder if there is
evidence of sparsity.

637
00:25:03,290 --> 00:25:05,820
Because sparsity is the
key to doing efficient

638
00:25:05,820 --> 00:25:08,090
inference when you
pose this problem.

639
00:25:08,090 --> 00:25:10,100
And so many algorithms
have basically

640
00:25:10,100 --> 00:25:13,160
boiled down to maintaining
sparsity and the underlying

641
00:25:13,160 --> 00:25:14,579
representations.

642
00:25:14,579 --> 00:25:16,370
So just briefly, the
most important thing I

643
00:25:16,370 --> 00:25:20,440
learned since then in
the last few years--

644
00:25:20,440 --> 00:25:23,640
I'm really excited by building
dense representations.

645
00:25:23,640 --> 00:25:27,290
So this is work in collaboration
with some folks in Ireland--

646
00:25:27,290 --> 00:25:29,840
Tom Whelan, John McDonald--
building on KinectFusion from

647
00:25:29,840 --> 00:25:32,570
Richard Newcombe
and Andrew Davison--

648
00:25:32,570 --> 00:25:36,080
how you can use a GPU to build
a volumetric representation,

649
00:25:36,080 --> 00:25:39,085
and build rich, dense models,
and estimate your motion as you

650
00:25:39,085 --> 00:25:39,960
go through the world.

651
00:25:39,960 --> 00:25:42,680
So this is something we
call continuous or spatially

652
00:25:42,680 --> 00:25:44,449
extended KinectFusion.

653
00:25:44,449 --> 00:25:46,240
This little video here
from three years ago

654
00:25:46,240 --> 00:25:49,100
is going on in an
apartment in Ireland.

655
00:25:49,100 --> 00:25:50,960
And I'll show you
the end result.

656
00:25:50,960 --> 00:25:53,570
Just hand-carrying a
sensor through the world--

657
00:25:53,570 --> 00:25:56,060
and you can see the quality
of the reconstructions

658
00:25:56,060 --> 00:25:57,590
you can build, say,
in the bathroom,

659
00:25:57,590 --> 00:26:02,216
the sink, the tub, the stairs,
to have really rich 3D models

660
00:26:02,216 --> 00:26:04,340
that we can build and then
enable the more advanced

661
00:26:04,340 --> 00:26:06,080
interactions that Russ showed.

662
00:26:06,080 --> 00:26:07,380
That's fantastic.

663
00:26:07,380 --> 00:26:08,970
And I mentioned loop closing--

664
00:26:08,970 --> 00:26:10,386
something we did
a couple of years

665
00:26:10,386 --> 00:26:13,100
ago was adding loop closing to
these dense representations.

666
00:26:13,100 --> 00:26:16,340
So this is-- again, in
CSAIL-- this is walking around

667
00:26:16,340 --> 00:26:19,880
the Stata Center with
about eight minutes of data

668
00:26:19,880 --> 00:26:21,500
going up and down stairs.

669
00:26:21,500 --> 00:26:26,180
If you watch the two blue chairs
near Randy Davis's office,

670
00:26:26,180 --> 00:26:28,010
you can see how they
get locked into place

671
00:26:28,010 --> 00:26:29,640
as you correct the error.

672
00:26:29,640 --> 00:26:32,210
So this is taking mesh
deformation techniques

673
00:26:32,210 --> 00:26:33,650
from graphics and combining it.

674
00:26:33,650 --> 00:26:35,960
So the underlying pose
graph representation

675
00:26:35,960 --> 00:26:38,390
is like a foundation or
a skeleton on which you

676
00:26:38,390 --> 00:26:41,870
build the rich representation.

677
00:26:41,870 --> 00:26:42,500
OK.

678
00:26:42,500 --> 00:26:46,305
So this is the
end resulting map.

679
00:26:46,305 --> 00:26:48,680
And there's been some really
exciting work just this year

680
00:26:48,680 --> 00:26:51,560
from Whelan and from Newcombe
in this space of doing

681
00:26:51,560 --> 00:26:55,520
deformable objects, and
then really scalable

682
00:26:55,520 --> 00:26:57,740
algorithms where you can
sort of paint the world.

683
00:26:57,740 --> 00:26:59,365
So the final thing
I want to talk about

684
00:26:59,365 --> 00:27:01,370
in my last few minutes
is our latest work

685
00:27:01,370 --> 00:27:04,280
of using object-based
representations.

686
00:27:04,280 --> 00:27:06,440
And for this audience,
I think if you go back

687
00:27:06,440 --> 00:27:09,260
to David Marr, who I
feel is unappreciated

688
00:27:09,260 --> 00:27:12,470
in the historical
sense of how I feel,

689
00:27:12,470 --> 00:27:14,800
that vision is the
process of discovering

690
00:27:14,800 --> 00:27:18,500
from images what is present
in the world and where it is.

691
00:27:18,500 --> 00:27:21,890
And to me, the what
and where are coupled.

692
00:27:21,890 --> 00:27:23,360
And maybe that's
been lost a bit.

693
00:27:23,360 --> 00:27:26,000
And I think that's one
way in which robotics

694
00:27:26,000 --> 00:27:29,570
can help, I think, with
vision and brain sciences.

695
00:27:29,570 --> 00:27:33,170
I think we need to develop
object-based understanding

696
00:27:33,170 --> 00:27:33,714
of the world.

697
00:27:33,714 --> 00:27:35,630
So instead of just having
representations that

698
00:27:35,630 --> 00:27:38,600
are a massive amount of
points or purely appearance,

699
00:27:38,600 --> 00:27:40,630
where we can start
to build higher level

700
00:27:40,630 --> 00:27:42,830
and symbolic understanding
of the world.

701
00:27:42,830 --> 00:27:45,680
And so I want to build
rich representations

702
00:27:45,680 --> 00:27:48,110
that leverage knowledge
of your location

703
00:27:48,110 --> 00:27:50,570
to better understand where
objects are and knowledge

704
00:27:50,570 --> 00:27:53,060
about objects to better
understand your location.

705
00:27:53,060 --> 00:27:57,070
And just as a step in that
direction, my student, Sudeep

706
00:27:57,070 --> 00:28:00,050
Pallai, who was one
of Seth's students,

707
00:28:00,050 --> 00:28:05,180
has an RSS paper where we
looked at coupling using SLAM

708
00:28:05,180 --> 00:28:09,500
to get better object recognition
by effectively-- so here's

709
00:28:09,500 --> 00:28:13,546
an example of an input data
stream from Peter Fox's group.

710
00:28:13,546 --> 00:28:15,170
There's just some
objects on the table.

711
00:28:15,170 --> 00:28:17,510
I realize it's a relatively
uncluttered scene.

712
00:28:17,510 --> 00:28:20,630
But this has been a benchmark
for RGBD perception.

713
00:28:20,630 --> 00:28:22,730
And so, if you
combine data as you

714
00:28:22,730 --> 00:28:26,540
move from the world
using a SLAM system to do

715
00:28:26,540 --> 00:28:29,600
3D reconstruction on the
scene, and then using

716
00:28:29,600 --> 00:28:32,960
the reconstructed points to
help improve the prediction

717
00:28:32,960 --> 00:28:35,330
process for object
recognition, it

718
00:28:35,330 --> 00:28:41,427
leads to a more scalable
system for recognizing objects.

719
00:28:41,427 --> 00:28:43,010
And it comes back
to this notion to me

720
00:28:43,010 --> 00:28:45,530
that a big part of perception
is prediction-- the ability

721
00:28:45,530 --> 00:28:48,162
to predict what you see
from a given location.

722
00:28:48,162 --> 00:28:50,120
And so what we're doing
is we're leveraging off

723
00:28:50,120 --> 00:28:54,140
techniques and object detection,
featuring coding and the newer

724
00:28:54,140 --> 00:28:55,670
SLAM algorithms,
and particularly

725
00:28:55,670 --> 00:28:59,290
the semi-dense orb SLAM
technique from Zaragoza, Spain.

726
00:28:59,290 --> 00:29:01,820
And so I'm just going
to jump to the end here.

727
00:29:01,820 --> 00:29:06,500
The key concept is
that by combining

728
00:29:06,500 --> 00:29:09,580
SLAM with object detection we
get much better performance

729
00:29:09,580 --> 00:29:11,550
and object recognition.

730
00:29:11,550 --> 00:29:14,539
So on the left shows our system.

731
00:29:14,539 --> 00:29:16,580
On the right is a classical
approach just looking

732
00:29:16,580 --> 00:29:18,242
at individual frames.

733
00:29:18,242 --> 00:29:19,700
And you can see,
for example, here,

734
00:29:19,700 --> 00:29:21,800
the red cup that's
been misclassified

735
00:29:21,800 --> 00:29:25,580
would get substantially better
performance by using location

736
00:29:25,580 --> 00:29:28,290
to cue the object
detection techniques.

737
00:29:28,290 --> 00:29:29,100
All right.

738
00:29:29,100 --> 00:29:30,110
So I'm going to wrap up.

739
00:29:30,110 --> 00:29:32,840
And just a little bit of
biological inspiration

740
00:29:32,840 --> 00:29:34,760
from our BU
collaborators, Eichenbaum

741
00:29:34,760 --> 00:29:37,430
has looked at the what
and the where pathways

742
00:29:37,430 --> 00:29:39,240
in the entorhinal cortex.

743
00:29:39,240 --> 00:29:42,260
And there's this duality between
location-based and object-based

744
00:29:42,260 --> 00:29:43,940
representations in the brain.

745
00:29:43,940 --> 00:29:46,140
And I think that's
very important.

746
00:29:46,140 --> 00:29:46,670
OK.

747
00:29:46,670 --> 00:29:50,834
So my dream is persistent
autonomy and lifelong map

748
00:29:50,834 --> 00:29:52,250
learning and making
things robust.

749
00:29:52,250 --> 00:29:53,666
And just for this
group I made a--

750
00:29:53,666 --> 00:29:55,430
I just want to
pose some questions

751
00:29:55,430 --> 00:29:57,290
on the biological side,
and I'll stop here.

752
00:29:57,290 --> 00:30:01,100
So some questions-- do
biological representations

753
00:30:01,100 --> 00:30:03,350
support multiple
location hypotheses?

754
00:30:03,350 --> 00:30:05,780
Even though we think
we know where we are,

755
00:30:05,780 --> 00:30:08,679
robots are faced with multimodal
situations all the time.

756
00:30:08,679 --> 00:30:10,220
And I wonder if
there is any evidence

757
00:30:10,220 --> 00:30:13,790
for multiple hypotheses in
the underlying representations

758
00:30:13,790 --> 00:30:17,690
in the brain, even if they don't
rise to the conscious level,

759
00:30:17,690 --> 00:30:20,780
and how experiences
build over time.

760
00:30:20,780 --> 00:30:23,850
And the question-- what are
the grid cells really doing?

761
00:30:23,850 --> 00:30:26,120
Are they a form of
path integration?

762
00:30:26,120 --> 00:30:29,060
Or there obviously, to me,
seems to be some correction.

763
00:30:29,060 --> 00:30:32,450
And my crazy hypothesis as
a non-brain brain scientist

764
00:30:32,450 --> 00:30:35,360
is, do grid cells serve as
an indexing mechanism that

765
00:30:35,360 --> 00:30:39,290
effectively facilitates search--
so a location index search

766
00:30:39,290 --> 00:30:42,110
so that you can have
these pointers to what

767
00:30:42,110 --> 00:30:46,270
and where information
get coupled together.