1 00:00:01,640 --> 00:00:04,040 The following content is provided under a Creative 2 00:00:04,040 --> 00:00:05,580 Commons license. 3 00:00:05,580 --> 00:00:07,880 Your support will help MIT OpenCourseWare 4 00:00:07,880 --> 00:00:12,270 continue to offer high quality educational resources for free. 5 00:00:12,270 --> 00:00:14,870 To make a donation or view additional materials 6 00:00:14,870 --> 00:00:18,830 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,830 --> 00:00:21,839 at ocw.mit.edu. 8 00:00:21,839 --> 00:00:22,880 JOHN LEONARD: OK, thanks. 9 00:00:22,880 --> 00:00:24,338 Thanks for the opportunity to talk. 10 00:00:24,338 --> 00:00:25,940 So hi, everyone. 11 00:00:25,940 --> 00:00:27,770 It's a great pleasure to talk here at MBL. 12 00:00:27,770 --> 00:00:29,769 I've been coming to the Woods Hole Oceanographic 13 00:00:29,769 --> 00:00:33,440 Institution for many years as my first thing over here at MBL. 14 00:00:33,440 --> 00:00:37,370 And so I'm going to try to cover three different topics, which 15 00:00:37,370 --> 00:00:39,110 is probably a little ambitious on time. 16 00:00:39,110 --> 00:00:41,420 But there's so much I'd love to say to you. 17 00:00:41,420 --> 00:00:43,370 I want to talk about self-driving cars. 18 00:00:43,370 --> 00:00:46,850 And use it as a context to think about questions 19 00:00:46,850 --> 00:00:49,910 of representation for localization and mapping, 20 00:00:49,910 --> 00:00:52,550 and maybe connect it into some of the brain questions 21 00:00:52,550 --> 00:00:54,046 that you folks are interested in, 22 00:00:54,046 --> 00:00:55,670 and time permitting, at the end mention 23 00:00:55,670 --> 00:00:58,640 a little bit of work we've done on object-based mapping 24 00:00:58,640 --> 00:00:59,600 in my lab. 25 00:00:59,600 --> 00:01:03,180 So my background-- I grew up in Philadelphia. 26 00:01:03,180 --> 00:01:05,540 Went to UPenn for engineering. 27 00:01:05,540 --> 00:01:08,660 But then went to Oxford to do my PhD at a very exciting time 28 00:01:08,660 --> 00:01:10,910 when their computer vision and robotics group was just 29 00:01:10,910 --> 00:01:13,700 being formed at Oxford under Michael Brady. 30 00:01:13,700 --> 00:01:16,190 And then I came back to MIT and started working 31 00:01:16,190 --> 00:01:17,270 with underwater vehicles. 32 00:01:17,270 --> 00:01:19,850 And that's when I got involved with Woods Hole Oceanographic 33 00:01:19,850 --> 00:01:20,990 Institution. 34 00:01:20,990 --> 00:01:23,390 And I was very fortunate to join the AI lab back 35 00:01:23,390 --> 00:01:27,710 around 2002, which became part of CSAIL. 36 00:01:27,710 --> 00:01:29,390 And really, I've been able to work 37 00:01:29,390 --> 00:01:32,450 with really amazing colleagues and amazing robots 38 00:01:32,450 --> 00:01:35,100 in a challenging set of environments. 39 00:01:35,100 --> 00:01:37,250 So autonomous underwater vehicles 40 00:01:37,250 --> 00:01:42,020 provide a very unique challenge because we have very poor 41 00:01:42,020 --> 00:01:43,170 communications to them. 42 00:01:43,170 --> 00:01:44,720 Typically, we use acoustic modems 43 00:01:44,720 --> 00:01:49,040 that might give you 96 bytes if you're lucky every 10 seconds 44 00:01:49,040 --> 00:01:50,810 to a few kilometers range. 45 00:01:50,810 --> 00:01:55,280 And so we also need to think about the sort of constraints 46 00:01:55,280 --> 00:01:57,860 of running in real time onboard a vehicle. 47 00:01:57,860 --> 00:02:02,690 And so the sort of work that my lab's done-- 48 00:02:02,690 --> 00:02:04,970 the more we investigate more fundamental questions 49 00:02:04,970 --> 00:02:07,640 about robot perception, navigation, and mapping, 50 00:02:07,640 --> 00:02:09,780 we also are involved in building systems. 51 00:02:09,780 --> 00:02:13,310 So this is a project I did for the Office of Naval Research 52 00:02:13,310 --> 00:02:16,310 some years ago using small vehicles that 53 00:02:16,310 --> 00:02:18,410 would reacquire mine-like targets 54 00:02:18,410 --> 00:02:19,910 on the bottom for the Navy. 55 00:02:19,910 --> 00:02:22,100 And so this is an example of a more applied system 56 00:02:22,100 --> 00:02:24,710 where we had a very small resource-constrained platform. 57 00:02:24,710 --> 00:02:27,050 And the sort of work we did is a robot built a map 58 00:02:27,050 --> 00:02:29,300 as it performed its mission, and then matched 59 00:02:29,300 --> 00:02:31,620 the map against the prior map to do 60 00:02:31,620 --> 00:02:33,766 terminal guidance to a target. 61 00:02:33,766 --> 00:02:35,390 Another big system I was involved with, 62 00:02:35,390 --> 00:02:37,667 as Russ mentioned, was the Urban Challenge. 63 00:02:37,667 --> 00:02:39,500 And I'll say a bit about that in the context 64 00:02:39,500 --> 00:02:41,510 of self-driving cars. 65 00:02:41,510 --> 00:02:43,670 So let's see. 66 00:02:43,670 --> 00:02:47,780 So who's heard any of the recent statements from Elon Musk 67 00:02:47,780 --> 00:02:49,160 from Tesla? 68 00:02:49,160 --> 00:02:55,630 So he said self-driving cars are solved- he said. 69 00:02:55,630 --> 00:02:59,892 And a particular thing that he said that just made my-- 70 00:02:59,892 --> 00:03:01,850 I don't know, maybe steam came out of my head-- 71 00:03:01,850 --> 00:03:04,520 was that he compared autonomous cars 72 00:03:04,520 --> 00:03:06,709 with elevators that used to require operators 73 00:03:06,709 --> 00:03:07,750 but are now self-service. 74 00:03:07,750 --> 00:03:10,520 So imagine you getting in a car, pressing a button, 75 00:03:10,520 --> 00:03:14,150 and arriving at MIT in Cambridge 80 miles away, 76 00:03:14,150 --> 00:03:18,110 navigating through the Boston downtown 77 00:03:18,110 --> 00:03:19,626 highways and intersections. 78 00:03:19,626 --> 00:03:20,750 And maybe that will happen. 79 00:03:20,750 --> 00:03:23,010 But I think it's going to take a lot longer than folks 80 00:03:23,010 --> 00:03:23,510 are saying. 81 00:03:23,510 --> 00:03:25,760 And some of that comes from just fundamental questions 82 00:03:25,760 --> 00:03:28,400 and intelligence and robotics. 83 00:03:28,400 --> 00:03:31,580 So in a nutshell, when Musk says that self-driving is solved 84 00:03:31,580 --> 00:03:35,480 I think he's wrong, as much as I admire what 85 00:03:35,480 --> 00:03:37,720 Tesla and SpaceX have done. 86 00:03:37,720 --> 00:03:40,010 And so to talk about that, I think 87 00:03:40,010 --> 00:03:42,620 we need to be very honest as a field about our failures 88 00:03:42,620 --> 00:03:44,060 as well as our successes, and try 89 00:03:44,060 --> 00:03:47,690 to balance what you hear in the media with the reality of where 90 00:03:47,690 --> 00:03:49,340 I think we are. 91 00:03:49,340 --> 00:03:51,890 And so I wanted to quote verbatim 92 00:03:51,890 --> 00:03:55,470 what Russ said about the robotics challenge, 93 00:03:55,470 --> 00:03:59,390 about a project that was so exhausting and just 94 00:03:59,390 --> 00:04:03,100 all-consuming and so stressful, yet so rewarding. 95 00:04:03,100 --> 00:04:06,200 So we did this in 2006 and 2007-- 96 00:04:06,200 --> 00:04:08,210 my wonderful colleagues, Seth Teller, John Howe, 97 00:04:08,210 --> 00:04:09,350 Amelia Fratoli-- 98 00:04:09,350 --> 00:04:11,170 amazing students and postdocs. 99 00:04:11,170 --> 00:04:13,040 We had a very large team. 100 00:04:13,040 --> 00:04:14,810 And we tried to push the limit on what 101 00:04:14,810 --> 00:04:17,970 was possible with perception and real-time motion planning. 102 00:04:17,970 --> 00:04:20,209 So our vehicle built a local map as it traveled 103 00:04:20,209 --> 00:04:23,820 from its perceptual data, using data from laser scanners 104 00:04:23,820 --> 00:04:25,070 and cameras. 105 00:04:25,070 --> 00:04:27,410 And we didn't want to blindly follow GPS. 106 00:04:27,410 --> 00:04:29,630 We wanted the car to make its own decisions 107 00:04:29,630 --> 00:04:31,929 because GPS navigation was part of the original quest 108 00:04:31,929 --> 00:04:32,720 with the challenge. 109 00:04:32,720 --> 00:04:35,570 And so Seth Teller and his student, Albert Wang, 110 00:04:35,570 --> 00:04:38,380 developed a vision-based perceptual system 111 00:04:38,380 --> 00:04:42,260 where the car tried to detect from curbs and lane markings 112 00:04:42,260 --> 00:04:44,450 in very challenging vision conditions. 113 00:04:44,450 --> 00:04:46,670 For example, looking into the sun, which 114 00:04:46,670 --> 00:04:47,720 you'll see in a second-- 115 00:04:47,720 --> 00:04:51,770 really challenging situation for trying to perceive the world. 116 00:04:51,770 --> 00:04:53,120 And so our vehicle-- 117 00:04:53,120 --> 00:04:56,480 at the time, we went a little crazy on the computation. 118 00:04:56,480 --> 00:05:00,500 We had 10 blades, each with four cores-- 119 00:05:00,500 --> 00:05:04,190 40 cores-- which may not seem a lot now, but we needed 3.5 120 00:05:04,190 --> 00:05:07,340 kilowatts just to power the computer at full tilt. 121 00:05:07,340 --> 00:05:09,650 We fully loaded the computer with a randomized motion 122 00:05:09,650 --> 00:05:11,870 planner, with all these perception algorithms. 123 00:05:11,870 --> 00:05:14,240 We had a Velodyne laser scanner on the roof. 124 00:05:14,240 --> 00:05:19,700 And about 12 other laser scanners, 5 cameras, 15 radars, 125 00:05:19,700 --> 00:05:22,710 and we really pushed the envelope on algorithms. 126 00:05:22,710 --> 00:05:25,430 And so when faced with a choice in a DARPA challenge, 127 00:05:25,430 --> 00:05:27,010 if you want to win at all costs you 128 00:05:27,010 --> 00:05:29,135 might simplify, or try to read the rules carefully, 129 00:05:29,135 --> 00:05:30,782 or guess the rule simplifications. 130 00:05:30,782 --> 00:05:32,240 But that would have meant just sort 131 00:05:32,240 --> 00:05:34,280 of turning off the work of our PhD students, 132 00:05:34,280 --> 00:05:35,730 and we didn't want to do that. 133 00:05:35,730 --> 00:05:37,897 So at the end of the day, all credit to the teams 134 00:05:37,897 --> 00:05:38,480 that did well. 135 00:05:38,480 --> 00:05:40,610 Carnegie Mellon-- first, $2 million, 136 00:05:40,610 --> 00:05:43,610 Stanford-- second, $1 million, Virginia Tech-- third, 137 00:05:43,610 --> 00:05:45,830 half a million dollars, MIT-- fourth, 138 00:05:45,830 --> 00:05:47,770 and nothing for fourth place. 139 00:05:47,770 --> 00:05:51,470 But it was quite an amazing experience. 140 00:05:51,470 --> 00:05:54,386 And in the spirit of advertising our failures 141 00:05:54,386 --> 00:05:55,760 I think I have time to show this. 142 00:05:55,760 --> 00:05:57,440 This used to be painful for me to watch. 143 00:05:57,440 --> 00:05:59,239 But now I've gotten over it. 144 00:05:59,239 --> 00:05:59,780 This is our-- 145 00:05:59,780 --> 00:06:00,446 [VIDEO PLAYBACK] 146 00:06:00,446 --> 00:06:04,010 - Let's check in once again with the boss. 147 00:06:04,010 --> 00:06:06,170 JOHN LEONARD: Even though we finished the race, 148 00:06:06,170 --> 00:06:08,180 we had a few incidents so DARPA stopped things and let 149 00:06:08,180 --> 00:06:08,680 us continue. 150 00:06:08,680 --> 00:06:09,514 - --across the line. 151 00:06:09,514 --> 00:06:11,513 JOHN LEONARD: Carnegie-Mellon, who won the race. 152 00:06:11,513 --> 00:06:12,280 Why did that stop? 153 00:06:12,280 --> 00:06:13,626 Let's see. 154 00:06:13,626 --> 00:06:17,330 - --at the end of mission two behind Virginia Tech. 155 00:06:17,330 --> 00:06:21,087 Virginia Tech got a little issue. 156 00:06:21,087 --> 00:06:21,920 [INAUDIBLE] Here's-- 157 00:06:21,920 --> 00:06:24,545 JOHN LEONARD: We were trying to pass Cornell for a few minutes. 158 00:06:24,545 --> 00:06:26,380 - Looks like they're stopped. 159 00:06:26,380 --> 00:06:29,140 And it looks like they're-- 160 00:06:29,140 --> 00:06:31,550 that the 79 is trying to pass and has 161 00:06:31,550 --> 00:06:35,791 passed the chase vehicle for Skynet, the 26 vehicle. 162 00:06:35,791 --> 00:06:36,290 Wow. 163 00:06:36,290 --> 00:06:37,450 And now he's done it. 164 00:06:37,450 --> 00:06:39,020 And Talos is going to pass. 165 00:06:39,020 --> 00:06:40,680 Very aggressive. 166 00:06:40,680 --> 00:06:42,256 And, whoa. 167 00:06:42,256 --> 00:06:43,120 Ohh. 168 00:06:43,120 --> 00:06:45,370 We had our first collision. 169 00:06:45,370 --> 00:06:47,380 Crash in turn one. 170 00:06:47,380 --> 00:06:48,730 Oh boy. 171 00:06:48,730 --> 00:06:51,030 That is, you know, that's a bold maneuver. 172 00:06:54,376 --> 00:06:55,130 [END PLAYBACK] 173 00:06:55,130 --> 00:06:58,090 JOHN LEONARD: So what actually happened? 174 00:06:58,090 --> 00:06:59,630 So it turned out Cornell were having 175 00:06:59,630 --> 00:07:00,770 problems with their actuators. 176 00:07:00,770 --> 00:07:02,420 They were sort of stopping and starting and stopping 177 00:07:02,420 --> 00:07:02,961 and starting. 178 00:07:02,961 --> 00:07:04,910 And we had some problems. 179 00:07:04,910 --> 00:07:06,480 It turned out we had about five bugs. 180 00:07:06,480 --> 00:07:08,750 They had about five bugs that interacted. 181 00:07:08,750 --> 00:07:10,452 And here's a computer's eye-- 182 00:07:10,452 --> 00:07:11,910 sort of, brain of the robot's view. 183 00:07:11,910 --> 00:07:14,750 Now back in '07, we weren't using a lot of vision 184 00:07:14,750 --> 00:07:17,240 for object detection and classification. 185 00:07:17,240 --> 00:07:18,680 So with the laser scanner-- 186 00:07:18,680 --> 00:07:20,510 the Cornell vehicle's there. 187 00:07:20,510 --> 00:07:21,519 It has a license plate. 188 00:07:21,519 --> 00:07:22,310 It has tail lights. 189 00:07:22,310 --> 00:07:23,269 It has a big number 26. 190 00:07:23,269 --> 00:07:24,476 It's on the middle of a road. 191 00:07:24,476 --> 00:07:25,680 We should know that's a car. 192 00:07:25,680 --> 00:07:26,504 Stay away from it. 193 00:07:26,504 --> 00:07:27,920 But to the laser scanner it's just 194 00:07:27,920 --> 00:07:29,680 a blob of laser scanner data. 195 00:07:29,680 --> 00:07:33,440 And even when we pull around the side of the car 196 00:07:33,440 --> 00:07:36,050 we weren't clever enough with our algorithms 197 00:07:36,050 --> 00:07:37,554 to fill in the fact that it's a car. 198 00:07:37,554 --> 00:07:39,470 And you have the problem when it starts moving 199 00:07:39,470 --> 00:07:40,605 of the aperture problem-- 200 00:07:40,605 --> 00:07:42,230 that as you're moving, and it's moving, 201 00:07:42,230 --> 00:07:44,990 it's very hard to tell and deduce the true motion. 202 00:07:44,990 --> 00:07:48,250 Now, another thing that happened was we had a threshold. 203 00:07:48,250 --> 00:07:51,200 And so in our 150,000 lines of code 204 00:07:51,200 --> 00:07:52,790 our wonderfully gifted student, who's 205 00:07:52,790 --> 00:07:55,280 now a tenured professor at Michigan, Ed Olson, 206 00:07:55,280 --> 00:07:57,500 had a threshold of 3 meters per second. 207 00:07:57,500 --> 00:08:00,110 So anything moving faster than 3 meters per second 208 00:08:00,110 --> 00:08:01,400 could be a car. 209 00:08:01,400 --> 00:08:03,920 Anything less than 3 meters per second couldn't be a car. 210 00:08:03,920 --> 00:08:05,436 Now that might seem kind of silly. 211 00:08:05,436 --> 00:08:07,310 But it turns out that slowly moving obstacles 212 00:08:07,310 --> 00:08:08,960 are much harder to detect and classify 213 00:08:08,960 --> 00:08:10,460 than fast moving obstacles. 214 00:08:10,460 --> 00:08:12,470 That's one reason that city driving or driving, 215 00:08:12,470 --> 00:08:15,710 say, in a shopping mall parking lot is actually in many ways 216 00:08:15,710 --> 00:08:17,990 more challenging than driving on the highway. 217 00:08:17,990 --> 00:08:22,707 And so despite our best efforts to stop at the last minute, 218 00:08:22,707 --> 00:08:25,040 we steered into the car and had this little minor fender 219 00:08:25,040 --> 00:08:26,120 bender. 220 00:08:26,120 --> 00:08:28,490 But one thing that we did is we made all our data 221 00:08:28,490 --> 00:08:29,480 available open source. 222 00:08:29,480 --> 00:08:31,105 And we actually wrote a journal article 223 00:08:31,105 --> 00:08:35,159 on this incident and a few others. 224 00:08:35,159 --> 00:08:37,520 And so if you'd asked me then in 2007, 225 00:08:37,520 --> 00:08:39,720 I would have said we're a long way from turning 226 00:08:39,720 --> 00:08:41,659 your car loose on the streets of Boston 227 00:08:41,659 --> 00:08:43,580 with absolutely no user input. 228 00:08:43,580 --> 00:08:47,400 And the real challenge is our uncertainty and robustness 229 00:08:47,400 --> 00:08:50,410 and developing robust systems that really work. 230 00:08:50,410 --> 00:08:53,240 But for our system, some of the algorithm progress we made-- 231 00:08:53,240 --> 00:08:54,860 I mentioned the lane tracking. 232 00:08:54,860 --> 00:08:59,270 Albert Wang, who's now, I think, working at Google, developed-- 233 00:08:59,270 --> 00:09:01,070 was given very sparse-- 234 00:09:01,070 --> 00:09:03,532 I'd say about 10% of the recent graduates 235 00:09:03,532 --> 00:09:05,240 or more are working at Google these days. 236 00:09:05,240 --> 00:09:06,380 AUDIENCE: Albert's at [INAUDIBLE].. 237 00:09:06,380 --> 00:09:07,120 JOHN LEONARD: Oh. 238 00:09:07,120 --> 00:09:08,090 OK. 239 00:09:08,090 --> 00:09:16,100 And then here is a video for the qualifying event 240 00:09:16,100 --> 00:09:17,570 to get into the final race. 241 00:09:17,570 --> 00:09:19,790 We had to navigate-- whoops, I can't press the mouse. 242 00:09:19,790 --> 00:09:20,690 That's going to stop. 243 00:09:20,690 --> 00:09:26,990 So we had to navigate along a curved road 244 00:09:26,990 --> 00:09:28,340 with very sparse waypoints. 245 00:09:28,340 --> 00:09:30,022 And so, in real time the computer 246 00:09:30,022 --> 00:09:31,730 has to make decisions about what it sees. 247 00:09:31,730 --> 00:09:32,480 Where is the road? 248 00:09:32,480 --> 00:09:33,440 Where am I? 249 00:09:33,440 --> 00:09:34,602 Are there obstacles? 250 00:09:34,602 --> 00:09:36,560 And there are no parked cars in this situation, 251 00:09:36,560 --> 00:09:38,360 but other stretches had parked cars. 252 00:09:38,360 --> 00:09:41,021 And our car-- in a nutshell, if our robot became 253 00:09:41,021 --> 00:09:43,020 confused about where the road was it would stop. 254 00:09:43,020 --> 00:09:44,930 It would have to wait and get its courage up, 255 00:09:44,930 --> 00:09:47,741 like lowering its thresholds as it was stuck. 256 00:09:47,741 --> 00:09:49,490 But we were the only team to our knowledge 257 00:09:49,490 --> 00:09:52,530 to qualify without actually adding waypoints. 258 00:09:52,530 --> 00:09:54,030 So it turns out the other top teams, 259 00:09:54,030 --> 00:09:56,300 they just went in with a Google satellite image 260 00:09:56,300 --> 00:09:59,280 and just added a breadcrumb trail for the robot to follow, 261 00:09:59,280 --> 00:10:01,100 simplifying the perception. 262 00:10:01,100 --> 00:10:02,970 So this was back in '07. 263 00:10:02,970 --> 00:10:05,700 Now let's fast forward to 2015. 264 00:10:05,700 --> 00:10:09,470 And right now-- so of course, we have 265 00:10:09,470 --> 00:10:11,390 the Google self-driving car which has just 266 00:10:11,390 --> 00:10:13,550 been an amazing project. 267 00:10:13,550 --> 00:10:15,590 And you've all probably seen these videos, 268 00:10:15,590 --> 00:10:17,960 each with millions of hits on YouTube. 269 00:10:17,960 --> 00:10:22,460 The earlier one of taking a blind person for a ride 270 00:10:22,460 --> 00:10:26,090 to Taco Bell, this was driving-- that was 2012, city 271 00:10:26,090 --> 00:10:30,080 streets in 2014, spring 2015. 272 00:10:30,080 --> 00:10:32,960 And then the new Google car, which 273 00:10:32,960 --> 00:10:36,020 won't have a steering wheel in its final instantiation, 274 00:10:36,020 --> 00:10:37,237 won't have pedals. 275 00:10:37,237 --> 00:10:38,570 It will just have a stop button. 276 00:10:38,570 --> 00:10:40,880 And that's your analogy to the elevator. 277 00:10:40,880 --> 00:10:46,070 And so I think that the Google car is an amazing research 278 00:10:46,070 --> 00:10:50,990 project that might one day transform mobility. 279 00:10:50,990 --> 00:10:52,900 But I do think, with all sincerity-- 280 00:10:52,900 --> 00:10:55,310 so I rode in the Google car last summer. 281 00:10:55,310 --> 00:10:56,670 I was blown away. 282 00:10:56,670 --> 00:10:58,580 I felt like I was on the beach at Kitty Hawk. 283 00:10:58,580 --> 00:11:01,370 It's like this just really profound technology 284 00:11:01,370 --> 00:11:03,920 that could in the long term have a very big impact. 285 00:11:03,920 --> 00:11:05,720 And I have amazing respect for that team-- 286 00:11:05,720 --> 00:11:08,270 Chris Urmson, Mike Montemerlo, et cetera. 287 00:11:08,270 --> 00:11:11,180 But I think in the media and in others, the technology 288 00:11:11,180 --> 00:11:14,480 has been a bit overhyped, and it's poorly misunderstood. 289 00:11:14,480 --> 00:11:16,970 And a lot of it goes down to how the car localizes itself, 290 00:11:16,970 --> 00:11:18,920 how it uses prior maps, and how they 291 00:11:18,920 --> 00:11:21,080 simplify the task of driving. 292 00:11:21,080 --> 00:11:22,820 And so even though people like Musk 293 00:11:22,820 --> 00:11:24,870 have said driving is a solved problem, 294 00:11:24,870 --> 00:11:26,870 I think we have to be aware that just because it 295 00:11:26,870 --> 00:11:30,200 works for Google, doesn't mean it'll work for everybody else. 296 00:11:30,200 --> 00:11:33,740 So critical differences between Google and, say, everyone else. 297 00:11:33,740 --> 00:11:35,720 And this is with all respect to all players. 298 00:11:35,720 --> 00:11:36,886 I'm not trying to criticize. 299 00:11:36,886 --> 00:11:40,430 It's more just trying to balance the debate. 300 00:11:40,430 --> 00:11:42,680 The Google car localizes on the left 301 00:11:42,680 --> 00:11:45,440 with a prior map, where they map the lighter intensity off 302 00:11:45,440 --> 00:11:47,030 of the ground surface. 303 00:11:47,030 --> 00:11:49,070 And they will annotate the map by hand-- 304 00:11:49,070 --> 00:11:52,097 adding pedestrian crossings, adding stoplights. 305 00:11:52,097 --> 00:11:53,930 They'll drive a car around many, many times, 306 00:11:53,930 --> 00:11:57,146 and then do a SLAM process to optimize the map. 307 00:11:57,146 --> 00:11:58,520 But if the world changes, they're 308 00:11:58,520 --> 00:11:59,811 going to have to adapt to that. 309 00:11:59,811 --> 00:12:03,200 Now, they've shown the ability to do response to construction, 310 00:12:03,200 --> 00:12:04,772 bicyclists with hand signals. 311 00:12:04,772 --> 00:12:06,980 When I was in the car we crossed the railroad tracks. 312 00:12:06,980 --> 00:12:07,938 That just blew me away. 313 00:12:07,938 --> 00:12:11,570 I mean, it's pretty impressive capability but more 314 00:12:11,570 --> 00:12:14,330 a vision-based approach that just follows the lane markings. 315 00:12:14,330 --> 00:12:16,890 If the lane markings are good, everything's fine. 316 00:12:16,890 --> 00:12:19,035 In fact, Tesla either just have released-- 317 00:12:19,035 --> 00:12:21,410 or are about to release-- their autopilot software, which 318 00:12:21,410 --> 00:12:23,480 is an advanced lane keeping system. 319 00:12:23,480 --> 00:12:26,090 And Elon Musk, a few weeks ago, posted on Twitter 320 00:12:26,090 --> 00:12:29,240 that there's one last corner case for us to fix. 321 00:12:29,240 --> 00:12:32,330 And apparently he-- on part of his commute in the Los Angeles 322 00:12:32,330 --> 00:12:34,940 area there is well defined lane markings. 323 00:12:34,940 --> 00:12:37,790 And part of it is a concrete road with weeds and skid marks 324 00:12:37,790 --> 00:12:38,720 and so forth. 325 00:12:38,720 --> 00:12:41,630 And he said publicly that the system works well if the lane 326 00:12:41,630 --> 00:12:43,040 markings are well-defined. 327 00:12:43,040 --> 00:12:45,140 But for more challenging vision conditions 328 00:12:45,140 --> 00:12:47,690 like looking into the sun it doesn't work as well. 329 00:12:47,690 --> 00:12:49,910 And so the critical difference is 330 00:12:49,910 --> 00:12:52,640 if you're going to use the LiDAR with prior maps, 331 00:12:52,640 --> 00:12:54,740 you can do very precise localization down 332 00:12:54,740 --> 00:12:57,200 to less than 10 centimeters accuracy. 333 00:12:57,200 --> 00:13:00,380 And the way I think about it is robot navigation 334 00:13:00,380 --> 00:13:02,570 is about three things-- 335 00:13:02,570 --> 00:13:04,790 where do you want the robot to be? 336 00:13:04,790 --> 00:13:06,770 Where does the robot think it is? 337 00:13:06,770 --> 00:13:08,810 And where really is the robot? 338 00:13:08,810 --> 00:13:13,010 And when the robot thinks it's somewhere, 339 00:13:13,010 --> 00:13:15,559 but it's really somewhere different, that's really bad. 340 00:13:15,559 --> 00:13:16,100 That happens. 341 00:13:16,100 --> 00:13:19,010 We've lost underwater vehicles and had very nervous searches 342 00:13:19,010 --> 00:13:20,090 to find them-- 343 00:13:20,090 --> 00:13:23,780 luckily-- when the robot made a mistake. 344 00:13:23,780 --> 00:13:26,150 And so with the Google approach they really 345 00:13:26,150 --> 00:13:29,000 nail this "where am I" problem-- the localization problem. 346 00:13:29,000 --> 00:13:30,871 But it means having an expensive LiDar. 347 00:13:30,871 --> 00:13:32,120 It means having accurate maps. 348 00:13:32,120 --> 00:13:33,740 It means maintaining them. 349 00:13:33,740 --> 00:13:36,170 One critical distinction is between level four 350 00:13:36,170 --> 00:13:37,220 and level three. 351 00:13:37,220 --> 00:13:38,660 These are definitions of autonomy 352 00:13:38,660 --> 00:13:40,970 from the US government-- from NTSA. 353 00:13:40,970 --> 00:13:42,950 A level four car is what Google are 354 00:13:42,950 --> 00:13:44,780 trying to do now, which is really, 355 00:13:44,780 --> 00:13:46,460 you just-- you could go to sleep. 356 00:13:46,460 --> 00:13:48,530 The car has a 100% control. 357 00:13:48,530 --> 00:13:50,350 You couldn't intervene if you wanted to. 358 00:13:50,350 --> 00:13:51,350 You just press a button. 359 00:13:51,350 --> 00:13:51,850 Go to sleep. 360 00:13:51,850 --> 00:13:53,480 Wake up at your destination. 361 00:13:53,480 --> 00:13:56,120 Musk has said that he thinks within five years 362 00:13:56,120 --> 00:13:59,450 you can go to sleep in your car, which to me I just-- 363 00:13:59,450 --> 00:14:01,670 five decades would impress me, to be honest. 364 00:14:04,710 --> 00:14:08,330 But level three is when the car is going to do most of the job, 365 00:14:08,330 --> 00:14:10,880 but you have to take over if something goes wrong. 366 00:14:10,880 --> 00:14:15,290 And for example Delphi drove 99% of the way across the US 367 00:14:15,290 --> 00:14:18,200 in spring of this year, which is pretty impressive. 368 00:14:18,200 --> 00:14:20,660 But 50 miles had to be driven by people-- 369 00:14:20,660 --> 00:14:23,027 getting on and off of highways and city streets. 370 00:14:23,027 --> 00:14:24,860 And so there's something about human nature, 371 00:14:24,860 --> 00:14:27,110 and the way humans interact with autonomous systems, 372 00:14:27,110 --> 00:14:31,190 that it's actually kind of hard for a person to pay attention. 373 00:14:31,190 --> 00:14:36,470 Imagine if 99% of the time the car does it perfectly. 374 00:14:36,470 --> 00:14:38,780 But 1% of the time it's about to make a mistake, 375 00:14:38,780 --> 00:14:41,510 and you have to be alert to take over. 376 00:14:41,510 --> 00:14:44,240 And research experience from aviation 377 00:14:44,240 --> 00:14:46,400 has shown that humans are actually bad at that. 378 00:14:46,400 --> 00:14:49,750 And another issue is-- and this is-- 379 00:14:49,750 --> 00:14:52,400 I mean, Mountainview is pretty complicated-- lots of cyclists, 380 00:14:52,400 --> 00:14:54,590 pedestrians, I mentioned the railroad crossings, 381 00:14:54,590 --> 00:14:55,930 construction. 382 00:14:55,930 --> 00:14:58,790 But in California they've had this historic drought. 383 00:14:58,790 --> 00:15:02,340 And most of the testing has been done with no rain, for example, 384 00:15:02,340 --> 00:15:03,670 and no snow. 385 00:15:03,670 --> 00:15:06,370 And if you think about Boston and Boston roads, 386 00:15:06,370 --> 00:15:08,540 there are some pretty challenging situations. 387 00:15:08,540 --> 00:15:11,500 And so for myself, when I first-- a couple of years ago 388 00:15:11,500 --> 00:15:14,020 I said I didn't expect a taxi in Manhattan 389 00:15:14,020 --> 00:15:16,090 in my lifetime-- a fully autonomous taxi-- 390 00:15:16,090 --> 00:15:17,470 to go anywhere in Manhattan. 391 00:15:17,470 --> 00:15:19,570 And I got criticized online for saying that. 392 00:15:19,570 --> 00:15:24,160 So I put a dash cam on my car, and actually 393 00:15:24,160 --> 00:15:27,160 had my son record cell phone footage. 394 00:15:27,160 --> 00:15:28,870 The upper left is making a left turn 395 00:15:28,870 --> 00:15:30,929 near my house in Newton, Mass. 396 00:15:30,929 --> 00:15:32,470 And if you look to the right, there's 397 00:15:32,470 --> 00:15:34,600 cars as far as the eye can see. 398 00:15:34,600 --> 00:15:36,100 And if you look to the left, there's 399 00:15:36,100 --> 00:15:38,650 cars coming at pretty high rate of speed, with a mailbox, 400 00:15:38,650 --> 00:15:39,730 and a tree. 401 00:15:39,730 --> 00:15:44,590 And this is a really challenging behavior for a human, 402 00:15:44,590 --> 00:15:47,620 because it requires making a decision in real time. 403 00:15:47,620 --> 00:15:50,140 We want very high reliability in terms of detecting 404 00:15:50,140 --> 00:15:51,730 the cars coming from the left. 405 00:15:51,730 --> 00:15:53,560 But the way that I pulled out is to wave 406 00:15:53,560 --> 00:15:55,210 at a person in another car. 407 00:15:55,210 --> 00:15:58,140 And those sort of nods and waves-- 408 00:15:58,140 --> 00:15:59,890 they're some of the most challenging forms 409 00:15:59,890 --> 00:16:02,596 of human-computer interaction. 410 00:16:02,596 --> 00:16:03,970 So imagine vision algorithms that 411 00:16:03,970 --> 00:16:08,260 could detect a person nodding at you from the other direction. 412 00:16:08,260 --> 00:16:09,730 Or here's another situation. 413 00:16:09,730 --> 00:16:12,645 This is going through Coolidge Corner in Brookline. 414 00:16:12,645 --> 00:16:14,770 And I'll show a longer version of this in a second. 415 00:16:14,770 --> 00:16:15,820 But the light's green. 416 00:16:15,820 --> 00:16:17,836 And see here-- this police officer? 417 00:16:17,836 --> 00:16:19,960 So despite the green light, the police officer just 418 00:16:19,960 --> 00:16:23,350 raises their hand, and that means the signal to stop. 419 00:16:23,350 --> 00:16:27,190 And so interacting with crossing guards and people-- 420 00:16:27,190 --> 00:16:29,290 very challenging, as well as changes 421 00:16:29,290 --> 00:16:32,460 to the road surface and, of course, adverse weather. 422 00:16:32,460 --> 00:16:36,401 And so here's a longer sequence for that police officer. 423 00:16:36,401 --> 00:16:38,650 First of all, you'll see flashing lights on the left-- 424 00:16:38,650 --> 00:16:40,829 which may be flashing lights, you should pull over. 425 00:16:40,829 --> 00:16:42,370 Here you should just drive past them. 426 00:16:42,370 --> 00:16:43,744 It's just the cop left his lights 427 00:16:43,744 --> 00:16:45,400 on when he parked his car. 428 00:16:45,400 --> 00:16:46,780 But the light's red. 429 00:16:46,780 --> 00:16:48,550 And this police officer is waving me 430 00:16:48,550 --> 00:16:50,380 through a red light, which I think 431 00:16:50,380 --> 00:16:51,640 is a really advanced behavior. 432 00:16:51,640 --> 00:16:53,740 So imagine a car that's-- 433 00:16:53,740 --> 00:16:56,710 imagine the logic for OK, stop at red lights 434 00:16:56,710 --> 00:16:59,200 unless there's a police officer waving you through it, 435 00:16:59,200 --> 00:17:00,490 and how you get that reliable. 436 00:17:00,490 --> 00:17:02,823 And now we're going to pull up to the next intersection, 437 00:17:02,823 --> 00:17:06,290 and this police officer is going to stop at a green light. 438 00:17:06,290 --> 00:17:08,710 And so despite all the recent progress in vision, 439 00:17:08,710 --> 00:17:11,260 things like image labeling, ImageNet-- 440 00:17:11,260 --> 00:17:15,310 most of those systems are trained with vast archives 441 00:17:15,310 --> 00:17:18,369 of images from the internet where there's no context. 442 00:17:18,369 --> 00:17:21,780 And they're so challenging for even humans to classify. 443 00:17:21,780 --> 00:17:24,490 So that if you had some data sets, 444 00:17:24,490 --> 00:17:26,230 like the Caltech pedestrian data set, 445 00:17:26,230 --> 00:17:29,830 if you got 78% performance, that's really good. 446 00:17:29,830 --> 00:17:34,450 But we need 99.9999% or better performance 447 00:17:34,450 --> 00:17:36,400 before we're going to turn cars loose 448 00:17:36,400 --> 00:17:39,820 in the wild in these challenging situations. 449 00:17:39,820 --> 00:17:42,689 Now going back more to localization and mapping. 450 00:17:42,689 --> 00:17:44,230 Here I collected data for about three 451 00:17:44,230 --> 00:17:46,672 or four weeks of my commuting. 452 00:17:46,672 --> 00:17:47,755 This is crossing the Mass. 453 00:17:47,755 --> 00:17:51,040 Ave. Bridge going from Boston into Cambridge. 454 00:17:51,040 --> 00:17:52,540 And the lighting is a little tricky. 455 00:17:52,540 --> 00:17:54,760 But tell me what's different between the top 456 00:17:54,760 --> 00:17:56,480 and the bottom video. 457 00:17:59,310 --> 00:18:02,560 And notice, by the way, how close we come to this truck. 458 00:18:02,560 --> 00:18:04,900 The slightest angular error in your position estimate, 459 00:18:04,900 --> 00:18:07,670 really bad things could happen. 460 00:18:07,670 --> 00:18:10,300 But the top-- this is a long weekend. 461 00:18:10,300 --> 00:18:11,784 This is Veterans Day weekend. 462 00:18:11,784 --> 00:18:12,700 They repaved the Mass. 463 00:18:12,700 --> 00:18:13,200 Ave. Bridge. 464 00:18:13,200 --> 00:18:15,970 So on the bottom, the lane lines are gone. 465 00:18:15,970 --> 00:18:18,430 And so if you had an appearance-based localization 466 00:18:18,430 --> 00:18:20,140 algorithm like Google's, you would 467 00:18:20,140 --> 00:18:22,629 need to remap the bridge before you drove on it. 468 00:18:22,629 --> 00:18:23,920 But the lines aren't there yet. 469 00:18:23,920 --> 00:18:25,550 And how well is it going to work? 470 00:18:25,550 --> 00:18:28,490 And so, this is just a really tricky situation. 471 00:18:28,490 --> 00:18:30,140 And, of course, there's weather. 472 00:18:30,140 --> 00:18:32,290 Now, snow is difficult for things 473 00:18:32,290 --> 00:18:33,910 like traction and control. 474 00:18:33,910 --> 00:18:36,730 But for perception, if you look at how the Google car actually 475 00:18:36,730 --> 00:18:37,630 works-- 476 00:18:37,630 --> 00:18:39,130 if you're going to localize yourself 477 00:18:39,130 --> 00:18:44,080 based on precisely knowing the car's position down 478 00:18:44,080 --> 00:18:47,590 to centimeters so that you can predict what you should see, 479 00:18:47,590 --> 00:18:49,224 then if you can't see the road surface 480 00:18:49,224 --> 00:18:50,890 you're not going to be able to localize. 481 00:18:50,890 --> 00:18:53,770 And so this is just a reminder of the sorts of maps 482 00:18:53,770 --> 00:18:55,420 that Google uses. 483 00:18:55,420 --> 00:18:58,510 So I think to make it to really challenging weather and very 484 00:18:58,510 --> 00:19:00,939 complex environments, we need a higher level understanding 485 00:19:00,939 --> 00:19:01,480 of the world. 486 00:19:01,480 --> 00:19:04,450 I think more a semantic or object-based understanding 487 00:19:04,450 --> 00:19:05,660 of the world. 488 00:19:05,660 --> 00:19:08,680 And then, of course, there's difficulties in perception. 489 00:19:08,680 --> 00:19:11,400 And so what do you see in this picture? 490 00:19:16,010 --> 00:19:17,800 The sun? 491 00:19:17,800 --> 00:19:19,326 There's a green light there. 492 00:19:19,326 --> 00:19:20,950 I realize the lighting is really harsh, 493 00:19:20,950 --> 00:19:22,990 and maybe you could do polarization or something 494 00:19:22,990 --> 00:19:23,490 better. 495 00:19:23,490 --> 00:19:27,220 But does anyone see the traffic cop standing there? 496 00:19:27,220 --> 00:19:29,200 You can just make out his legs. 497 00:19:29,200 --> 00:19:32,230 There's a policeman there who gave me this little wave, 498 00:19:32,230 --> 00:19:34,300 even though I was sort of blinded by the sun. 499 00:19:34,300 --> 00:19:35,560 And he walked out and put his back to me 500 00:19:35,560 --> 00:19:37,060 and was waving pedestrians across, 501 00:19:37,060 --> 00:19:38,540 even though the light was green. 502 00:19:38,540 --> 00:19:41,020 So a purely vision-based system is 503 00:19:41,020 --> 00:19:45,980 going to just need dramatic leaps in visual performance. 504 00:19:45,980 --> 00:19:48,250 So to wrap up the self-driving car part, 505 00:19:48,250 --> 00:19:51,550 I think the big questions going forward-- technical challenges, 506 00:19:51,550 --> 00:19:55,990 maintaining the maps, dealing with adverse weather, 507 00:19:55,990 --> 00:19:57,280 interacting with people-- 508 00:19:57,280 --> 00:19:59,740 both inside and outside of the car-- 509 00:19:59,740 --> 00:20:02,860 and then getting truly robust computer vision algorithms. 510 00:20:02,860 --> 00:20:05,440 We want to get in a totally different place on the ROC 511 00:20:05,440 --> 00:20:07,750 curves, or the precision recall curves, 512 00:20:07,750 --> 00:20:11,830 where approaching perfect detection with no false alarms. 513 00:20:11,830 --> 00:20:14,930 And that's a really hard thing to do. 514 00:20:14,930 --> 00:20:16,990 So I've worked my whole life on the robot mapping 515 00:20:16,990 --> 00:20:18,160 and localization problem. 516 00:20:18,160 --> 00:20:19,960 And for this audience I wanted to just 517 00:20:19,960 --> 00:20:22,800 ask you a little question. 518 00:20:22,800 --> 00:20:25,920 Does anyone know what the 2014 Nobel Prize in medicine 519 00:20:25,920 --> 00:20:28,701 or physiology was for? 520 00:20:28,701 --> 00:20:29,200 Anybody? 521 00:20:29,200 --> 00:20:30,755 AUDIENCE: [INAUDIBLE] 522 00:20:30,755 --> 00:20:31,630 AUDIENCE: Grid cells. 523 00:20:31,630 --> 00:20:32,490 JOHN LEONARD: Grid cells. 524 00:20:32,490 --> 00:20:33,770 Grid cells and place cells. 525 00:20:33,770 --> 00:20:38,030 And so this has been called SLAM in the brain. 526 00:20:38,030 --> 00:20:38,950 Now, you might argue. 527 00:20:38,950 --> 00:20:41,920 And we might be very far from knowing. 528 00:20:41,920 --> 00:20:44,474 But I think it's just really exciting to-- 529 00:20:44,474 --> 00:20:45,640 so for myself, I'll explain. 530 00:20:45,640 --> 00:20:47,680 I've had what's called an ONR MURI grant-- 531 00:20:47,680 --> 00:20:50,610 multidisciplinary university research initiative grant-- 532 00:20:50,610 --> 00:20:52,510 with Mike Hasselmo and his colleagues 533 00:20:52,510 --> 00:20:53,980 at Boston University. 534 00:20:53,980 --> 00:20:56,560 And these are a couple of Mike's videos. 535 00:20:56,560 --> 00:20:59,650 And so, I think Matt Wilson spoke to your group. 536 00:20:59,650 --> 00:21:04,720 And the notion that in the entorhinal cortex 537 00:21:04,720 --> 00:21:06,994 that there is this sort of position information 538 00:21:06,994 --> 00:21:08,410 that's very metrical, and it seems 539 00:21:08,410 --> 00:21:10,390 to be at the heart of memory formation, 540 00:21:10,390 --> 00:21:14,150 to me is very powerful and very important. 541 00:21:14,150 --> 00:21:17,570 And so, we have this underlying question of representation. 542 00:21:17,570 --> 00:21:18,940 How do we represent the world? 543 00:21:18,940 --> 00:21:24,130 And I believe location is just absolutely 544 00:21:24,130 --> 00:21:27,057 vital to building memories and to developing 545 00:21:27,057 --> 00:21:28,390 advanced reasoning in the world. 546 00:21:28,390 --> 00:21:31,720 And the fact that grid cells exist-- 547 00:21:31,720 --> 00:21:34,240 to me-- and they have this role in memory formation 548 00:21:34,240 --> 00:21:37,390 is just this really exciting concept. 549 00:21:37,390 --> 00:21:41,800 And so, in robotics we call the problem 550 00:21:41,800 --> 00:21:44,710 of how a robot builds a map and uses that map to navigate, 551 00:21:44,710 --> 00:21:47,270 SLAM-- simultaneous localization and mapping. 552 00:21:47,270 --> 00:21:50,470 This is for a PR2 robot being driven around the second floor 553 00:21:50,470 --> 00:21:52,420 of our building, not far from Patrick's office 554 00:21:52,420 --> 00:21:54,430 if you recognize any of that. 555 00:21:54,430 --> 00:21:59,110 And this is using stereo vision. 556 00:21:59,110 --> 00:22:01,000 My PhD student, Hordur Johannsson, 557 00:22:01,000 --> 00:22:02,620 who graduated a couple of years ago, 558 00:22:02,620 --> 00:22:05,110 created a system to do real time SLAM 559 00:22:05,110 --> 00:22:06,790 and try to address how to get temporally 560 00:22:06,790 --> 00:22:08,829 scalable representations. 561 00:22:08,829 --> 00:22:10,870 And one thing you'll see as the robot goes around 562 00:22:10,870 --> 00:22:12,340 occasionally is loop closing, where 563 00:22:12,340 --> 00:22:14,381 the robot might come back and have like, an error 564 00:22:14,381 --> 00:22:15,890 and then correct that error. 565 00:22:15,890 --> 00:22:18,370 So this is the part of the SLAM problem that in some ways 566 00:22:18,370 --> 00:22:20,050 is well understood in robotics, which 567 00:22:20,050 --> 00:22:25,390 is how you detect features from images, track them over time, 568 00:22:25,390 --> 00:22:28,480 and try to bootstrap up, building a representation 569 00:22:28,480 --> 00:22:30,805 and using that to locate your estimation. 570 00:22:30,805 --> 00:22:33,370 And I've worked on this my whole career. 571 00:22:33,370 --> 00:22:38,020 And as a grad student at Oxford, I had very primitive sensors. 572 00:22:38,020 --> 00:22:41,140 So for a historical SLAM talk I recently digitized an old video 573 00:22:41,140 --> 00:22:42,650 and some old pictures. 574 00:22:42,650 --> 00:22:46,020 This was in the basement of the engineering building at Oxford. 575 00:22:46,020 --> 00:22:49,369 This is just the localization part of how you have a map, 576 00:22:49,369 --> 00:22:51,160 and you generate predictions-- in this case 577 00:22:51,160 --> 00:22:53,020 for sonar measurements. 578 00:22:53,020 --> 00:22:55,620 And at the time there we had-- 579 00:22:55,620 --> 00:22:57,190 I'm sitting at a SUN workstation. 580 00:22:57,190 --> 00:22:59,320 To my left is something called a data cube, 581 00:22:59,320 --> 00:23:02,320 which for about $100,000 could just barely 582 00:23:02,320 --> 00:23:07,180 do like real time frame grabbing and then edge detection out. 583 00:23:07,180 --> 00:23:09,730 And so vision just wasn't ready. 584 00:23:09,730 --> 00:23:12,157 And the exciting thing now in our field 585 00:23:12,157 --> 00:23:13,990 is vision is ready-- that we're really using 586 00:23:13,990 --> 00:23:15,622 vision in a substantial way. 587 00:23:15,622 --> 00:23:17,080 But I think a lot about prediction. 588 00:23:17,080 --> 00:23:18,850 If you know your position, you can 589 00:23:18,850 --> 00:23:21,430 predict what you should see and create a feedback loop. 590 00:23:21,430 --> 00:23:23,310 And that's sort of what we're trying to do. 591 00:23:23,310 --> 00:23:27,760 And so SLAM is a wonderful problem, 592 00:23:27,760 --> 00:23:30,790 I believe, for addressing a whole great set of questions, 593 00:23:30,790 --> 00:23:32,980 because there are these different axes of difficulty 594 00:23:32,980 --> 00:23:34,940 that interact with one another. 595 00:23:34,940 --> 00:23:36,210 And one is representation. 596 00:23:36,210 --> 00:23:37,460 How do we represent the world? 597 00:23:37,460 --> 00:23:39,280 And I think that question-- we still have 598 00:23:39,280 --> 00:23:41,330 a ton of things to think about. 599 00:23:41,330 --> 00:23:42,310 Another is inference. 600 00:23:42,310 --> 00:23:43,690 We want to do real time inference 601 00:23:43,690 --> 00:23:45,910 about what's where in the world and how we combine it 602 00:23:45,910 --> 00:23:47,410 all together. 603 00:23:47,410 --> 00:23:49,870 And finally, there's a systems in autonomy access, where 604 00:23:49,870 --> 00:23:52,030 we want to build systems, and deploy them, and have 605 00:23:52,030 --> 00:23:55,900 them operate robustly and reliably in the world. 606 00:23:55,900 --> 00:23:59,920 So in SLAM, here's an example of how we pose 607 00:23:59,920 --> 00:24:01,720 this as an inference problem. 608 00:24:01,720 --> 00:24:05,340 This is from the classic Victoria Park data 609 00:24:05,340 --> 00:24:07,600 set from Sydney University. 610 00:24:07,600 --> 00:24:10,720 A robot drives around, in this case, a park with some trees. 611 00:24:10,720 --> 00:24:12,800 There are landmarks shown in green. 612 00:24:12,800 --> 00:24:14,622 The robot's positioner drifts over time. 613 00:24:14,622 --> 00:24:15,830 We have dead reckoning error. 614 00:24:15,830 --> 00:24:17,820 That's shown in blue. 615 00:24:17,820 --> 00:24:20,840 And we estimate the trajectory of the robot in red, 616 00:24:20,840 --> 00:24:22,250 and the position of the landmarks 617 00:24:22,250 --> 00:24:23,349 from relative measurement. 618 00:24:23,349 --> 00:24:24,890 So as you take relative measurements, 619 00:24:24,890 --> 00:24:26,348 and you move through the world, how 620 00:24:26,348 --> 00:24:27,910 do you put that all together? 621 00:24:27,910 --> 00:24:29,990 And so we, cast this as an inference problem 622 00:24:29,990 --> 00:24:33,680 where we have the robot poses, the odometric inputs, 623 00:24:33,680 --> 00:24:36,350 landmarks-- you can do it with or without landmarks-- 624 00:24:36,350 --> 00:24:37,450 and measurements. 625 00:24:37,450 --> 00:24:40,280 And an interesting thing-- so we have this inference problem 626 00:24:40,280 --> 00:24:41,700 on a belief network. 627 00:24:41,700 --> 00:24:44,150 The key thing about SLAM is it's building up over time. 628 00:24:44,150 --> 00:24:46,280 So you start with nothing and the problem's growing 629 00:24:46,280 --> 00:24:47,480 ever larger. 630 00:24:47,480 --> 00:24:50,450 And, let's see, if I had to say-- 631 00:24:50,450 --> 00:24:53,060 25 years of thinking about this up through 2012, 632 00:24:53,060 --> 00:24:55,130 the most important thing I learned 633 00:24:55,130 --> 00:24:58,520 is that maintaining sparsity in the underlying representation 634 00:24:58,520 --> 00:24:59,150 is critical. 635 00:24:59,150 --> 00:25:00,650 And, in fact, for biological systems 636 00:25:00,650 --> 00:25:03,290 I wonder if there is evidence of sparsity. 637 00:25:03,290 --> 00:25:05,820 Because sparsity is the key to doing efficient 638 00:25:05,820 --> 00:25:08,090 inference when you pose this problem. 639 00:25:08,090 --> 00:25:10,100 And so many algorithms have basically 640 00:25:10,100 --> 00:25:13,160 boiled down to maintaining sparsity and the underlying 641 00:25:13,160 --> 00:25:14,579 representations. 642 00:25:14,579 --> 00:25:16,370 So just briefly, the most important thing I 643 00:25:16,370 --> 00:25:20,440 learned since then in the last few years-- 644 00:25:20,440 --> 00:25:23,640 I'm really excited by building dense representations. 645 00:25:23,640 --> 00:25:27,290 So this is work in collaboration with some folks in Ireland-- 646 00:25:27,290 --> 00:25:29,840 Tom Whelan, John McDonald-- building on KinectFusion from 647 00:25:29,840 --> 00:25:32,570 Richard Newcombe and Andrew Davison-- 648 00:25:32,570 --> 00:25:36,080 how you can use a GPU to build a volumetric representation, 649 00:25:36,080 --> 00:25:39,085 and build rich, dense models, and estimate your motion as you 650 00:25:39,085 --> 00:25:39,960 go through the world. 651 00:25:39,960 --> 00:25:42,680 So this is something we call continuous or spatially 652 00:25:42,680 --> 00:25:44,449 extended KinectFusion. 653 00:25:44,449 --> 00:25:46,240 This little video here from three years ago 654 00:25:46,240 --> 00:25:49,100 is going on in an apartment in Ireland. 655 00:25:49,100 --> 00:25:50,960 And I'll show you the end result. 656 00:25:50,960 --> 00:25:53,570 Just hand-carrying a sensor through the world-- 657 00:25:53,570 --> 00:25:56,060 and you can see the quality of the reconstructions 658 00:25:56,060 --> 00:25:57,590 you can build, say, in the bathroom, 659 00:25:57,590 --> 00:26:02,216 the sink, the tub, the stairs, to have really rich 3D models 660 00:26:02,216 --> 00:26:04,340 that we can build and then enable the more advanced 661 00:26:04,340 --> 00:26:06,080 interactions that Russ showed. 662 00:26:06,080 --> 00:26:07,380 That's fantastic. 663 00:26:07,380 --> 00:26:08,970 And I mentioned loop closing-- 664 00:26:08,970 --> 00:26:10,386 something we did a couple of years 665 00:26:10,386 --> 00:26:13,100 ago was adding loop closing to these dense representations. 666 00:26:13,100 --> 00:26:16,340 So this is-- again, in CSAIL-- this is walking around 667 00:26:16,340 --> 00:26:19,880 the Stata Center with about eight minutes of data 668 00:26:19,880 --> 00:26:21,500 going up and down stairs. 669 00:26:21,500 --> 00:26:26,180 If you watch the two blue chairs near Randy Davis's office, 670 00:26:26,180 --> 00:26:28,010 you can see how they get locked into place 671 00:26:28,010 --> 00:26:29,640 as you correct the error. 672 00:26:29,640 --> 00:26:32,210 So this is taking mesh deformation techniques 673 00:26:32,210 --> 00:26:33,650 from graphics and combining it. 674 00:26:33,650 --> 00:26:35,960 So the underlying pose graph representation 675 00:26:35,960 --> 00:26:38,390 is like a foundation or a skeleton on which you 676 00:26:38,390 --> 00:26:41,870 build the rich representation. 677 00:26:41,870 --> 00:26:42,500 OK. 678 00:26:42,500 --> 00:26:46,305 So this is the end resulting map. 679 00:26:46,305 --> 00:26:48,680 And there's been some really exciting work just this year 680 00:26:48,680 --> 00:26:51,560 from Whelan and from Newcombe in this space of doing 681 00:26:51,560 --> 00:26:55,520 deformable objects, and then really scalable 682 00:26:55,520 --> 00:26:57,740 algorithms where you can sort of paint the world. 683 00:26:57,740 --> 00:26:59,365 So the final thing I want to talk about 684 00:26:59,365 --> 00:27:01,370 in my last few minutes is our latest work 685 00:27:01,370 --> 00:27:04,280 of using object-based representations. 686 00:27:04,280 --> 00:27:06,440 And for this audience, I think if you go back 687 00:27:06,440 --> 00:27:09,260 to David Marr, who I feel is unappreciated 688 00:27:09,260 --> 00:27:12,470 in the historical sense of how I feel, 689 00:27:12,470 --> 00:27:14,800 that vision is the process of discovering 690 00:27:14,800 --> 00:27:18,500 from images what is present in the world and where it is. 691 00:27:18,500 --> 00:27:21,890 And to me, the what and where are coupled. 692 00:27:21,890 --> 00:27:23,360 And maybe that's been lost a bit. 693 00:27:23,360 --> 00:27:26,000 And I think that's one way in which robotics 694 00:27:26,000 --> 00:27:29,570 can help, I think, with vision and brain sciences. 695 00:27:29,570 --> 00:27:33,170 I think we need to develop object-based understanding 696 00:27:33,170 --> 00:27:33,714 of the world. 697 00:27:33,714 --> 00:27:35,630 So instead of just having representations that 698 00:27:35,630 --> 00:27:38,600 are a massive amount of points or purely appearance, 699 00:27:38,600 --> 00:27:40,630 where we can start to build higher level 700 00:27:40,630 --> 00:27:42,830 and symbolic understanding of the world. 701 00:27:42,830 --> 00:27:45,680 And so I want to build rich representations 702 00:27:45,680 --> 00:27:48,110 that leverage knowledge of your location 703 00:27:48,110 --> 00:27:50,570 to better understand where objects are and knowledge 704 00:27:50,570 --> 00:27:53,060 about objects to better understand your location. 705 00:27:53,060 --> 00:27:57,070 And just as a step in that direction, my student, Sudeep 706 00:27:57,070 --> 00:28:00,050 Pallai, who was one of Seth's students, 707 00:28:00,050 --> 00:28:05,180 has an RSS paper where we looked at coupling using SLAM 708 00:28:05,180 --> 00:28:09,500 to get better object recognition by effectively-- so here's 709 00:28:09,500 --> 00:28:13,546 an example of an input data stream from Peter Fox's group. 710 00:28:13,546 --> 00:28:15,170 There's just some objects on the table. 711 00:28:15,170 --> 00:28:17,510 I realize it's a relatively uncluttered scene. 712 00:28:17,510 --> 00:28:20,630 But this has been a benchmark for RGBD perception. 713 00:28:20,630 --> 00:28:22,730 And so, if you combine data as you 714 00:28:22,730 --> 00:28:26,540 move from the world using a SLAM system to do 715 00:28:26,540 --> 00:28:29,600 3D reconstruction on the scene, and then using 716 00:28:29,600 --> 00:28:32,960 the reconstructed points to help improve the prediction 717 00:28:32,960 --> 00:28:35,330 process for object recognition, it 718 00:28:35,330 --> 00:28:41,427 leads to a more scalable system for recognizing objects. 719 00:28:41,427 --> 00:28:43,010 And it comes back to this notion to me 720 00:28:43,010 --> 00:28:45,530 that a big part of perception is prediction-- the ability 721 00:28:45,530 --> 00:28:48,162 to predict what you see from a given location. 722 00:28:48,162 --> 00:28:50,120 And so what we're doing is we're leveraging off 723 00:28:50,120 --> 00:28:54,140 techniques and object detection, featuring coding and the newer 724 00:28:54,140 --> 00:28:55,670 SLAM algorithms, and particularly 725 00:28:55,670 --> 00:28:59,290 the semi-dense orb SLAM technique from Zaragoza, Spain. 726 00:28:59,290 --> 00:29:01,820 And so I'm just going to jump to the end here. 727 00:29:01,820 --> 00:29:06,500 The key concept is that by combining 728 00:29:06,500 --> 00:29:09,580 SLAM with object detection we get much better performance 729 00:29:09,580 --> 00:29:11,550 and object recognition. 730 00:29:11,550 --> 00:29:14,539 So on the left shows our system. 731 00:29:14,539 --> 00:29:16,580 On the right is a classical approach just looking 732 00:29:16,580 --> 00:29:18,242 at individual frames. 733 00:29:18,242 --> 00:29:19,700 And you can see, for example, here, 734 00:29:19,700 --> 00:29:21,800 the red cup that's been misclassified 735 00:29:21,800 --> 00:29:25,580 would get substantially better performance by using location 736 00:29:25,580 --> 00:29:28,290 to cue the object detection techniques. 737 00:29:28,290 --> 00:29:29,100 All right. 738 00:29:29,100 --> 00:29:30,110 So I'm going to wrap up. 739 00:29:30,110 --> 00:29:32,840 And just a little bit of biological inspiration 740 00:29:32,840 --> 00:29:34,760 from our BU collaborators, Eichenbaum 741 00:29:34,760 --> 00:29:37,430 has looked at the what and the where pathways 742 00:29:37,430 --> 00:29:39,240 in the entorhinal cortex. 743 00:29:39,240 --> 00:29:42,260 And there's this duality between location-based and object-based 744 00:29:42,260 --> 00:29:43,940 representations in the brain. 745 00:29:43,940 --> 00:29:46,140 And I think that's very important. 746 00:29:46,140 --> 00:29:46,670 OK. 747 00:29:46,670 --> 00:29:50,834 So my dream is persistent autonomy and lifelong map 748 00:29:50,834 --> 00:29:52,250 learning and making things robust. 749 00:29:52,250 --> 00:29:53,666 And just for this group I made a-- 750 00:29:53,666 --> 00:29:55,430 I just want to pose some questions 751 00:29:55,430 --> 00:29:57,290 on the biological side, and I'll stop here. 752 00:29:57,290 --> 00:30:01,100 So some questions-- do biological representations 753 00:30:01,100 --> 00:30:03,350 support multiple location hypotheses? 754 00:30:03,350 --> 00:30:05,780 Even though we think we know where we are, 755 00:30:05,780 --> 00:30:08,679 robots are faced with multimodal situations all the time. 756 00:30:08,679 --> 00:30:10,220 And I wonder if there is any evidence 757 00:30:10,220 --> 00:30:13,790 for multiple hypotheses in the underlying representations 758 00:30:13,790 --> 00:30:17,690 in the brain, even if they don't rise to the conscious level, 759 00:30:17,690 --> 00:30:20,780 and how experiences build over time. 760 00:30:20,780 --> 00:30:23,850 And the question-- what are the grid cells really doing? 761 00:30:23,850 --> 00:30:26,120 Are they a form of path integration? 762 00:30:26,120 --> 00:30:29,060 Or there obviously, to me, seems to be some correction. 763 00:30:29,060 --> 00:30:32,450 And my crazy hypothesis as a non-brain brain scientist 764 00:30:32,450 --> 00:30:35,360 is, do grid cells serve as an indexing mechanism that 765 00:30:35,360 --> 00:30:39,290 effectively facilitates search-- so a location index search 766 00:30:39,290 --> 00:30:42,110 so that you can have these pointers to what 767 00:30:42,110 --> 00:30:46,270 and where information get coupled together.