1
00:00:01,640 --> 00:00:04,040
The following content is
provided under a Creative

2
00:00:04,040 --> 00:00:05,580
Commons license.

3
00:00:05,580 --> 00:00:07,880
Your support will help
MIT OpenCourseWare

4
00:00:07,880 --> 00:00:12,270
continue to offer high-quality
educational resources for free.

5
00:00:12,270 --> 00:00:14,870
To make a donation or
view additional materials

6
00:00:14,870 --> 00:00:18,830
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,830 --> 00:00:21,749
at ocw.mit.edu.

8
00:00:21,749 --> 00:00:24,290
STEFANIE TELLEX: So today I'm
going to talk about human robot

9
00:00:24,290 --> 00:00:24,920
collaboration.

10
00:00:24,920 --> 00:00:28,767
How can we make robots that can
work together with people just

11
00:00:28,767 --> 00:00:30,350
as if they were
another person and try

12
00:00:30,350 --> 00:00:33,620
to achieve this kind of fluid
dynamic that people have

13
00:00:33,620 --> 00:00:34,820
when they work together?

14
00:00:34,820 --> 00:00:36,470
These are my human
collaborators.

15
00:00:36,470 --> 00:00:38,900
This work is done by a lot
of collaborative students

16
00:00:38,900 --> 00:00:40,490
and postdocs.

17
00:00:40,490 --> 00:00:43,460
So we're really in an
exciting time in robotics

18
00:00:43,460 --> 00:00:46,670
because robots are becoming
more and more capable

19
00:00:46,670 --> 00:00:49,770
and they're able to operate
in one structured environment.

20
00:00:49,770 --> 00:00:51,920
Russ gave a great
talk about Atlas

21
00:00:51,920 --> 00:00:55,517
doing things like driving
a car and opening doors.

22
00:00:55,517 --> 00:00:57,350
This is another robot
that I've worked with.

23
00:00:57,350 --> 00:01:00,260
A robotic forklift that can
drive around autonomously

24
00:01:00,260 --> 00:01:01,880
in warehouse environments.

25
00:01:01,880 --> 00:01:05,360
It can detect where pallets are,
track people, pick things up,

26
00:01:05,360 --> 00:01:06,650
put things down.

27
00:01:06,650 --> 00:01:08,810
And it's designed to do
this in collaboration

28
00:01:08,810 --> 00:01:11,330
with people who also
share the environment.

29
00:01:11,330 --> 00:01:13,580
There's robots that can
assemble IKEA furniture.

30
00:01:13,580 --> 00:01:16,040
This was Ross Knepper
and Daniela Rus

31
00:01:16,040 --> 00:01:18,200
at MIT that I worked
with to do this.

32
00:01:18,200 --> 00:01:22,340
So they made this team of
robots that can autonomously

33
00:01:22,340 --> 00:01:27,920
assemble tables and chairs
that are produced by IKEA.

34
00:01:27,920 --> 00:01:29,570
And what would be
nice is if people

35
00:01:29,570 --> 00:01:30,770
can work with these robots.

36
00:01:30,770 --> 00:01:32,540
Sometimes they
encounter failures,

37
00:01:32,540 --> 00:01:34,280
and the person might
be able to intervene

38
00:01:34,280 --> 00:01:37,730
in a way that enables the
robot to recover from failure.

39
00:01:37,730 --> 00:01:40,460
And kind of the dream
is robots that operate

40
00:01:40,460 --> 00:01:42,082
in household environments.

41
00:01:42,082 --> 00:01:44,540
So this is my son when he was
about nine months old when we

42
00:01:44,540 --> 00:01:47,720
shot this picture with a PR2.

43
00:01:47,720 --> 00:01:49,725
You'd really like
to imagine a robot--

44
00:01:49,725 --> 00:01:52,100
Rosie the robot from the
Jetsons that lives in your house

45
00:01:52,100 --> 00:01:54,110
with you and helps you
in all kinds of ways.

46
00:01:54,110 --> 00:01:57,530
Anything from doing the laundry
to cleaning up your room,

47
00:01:57,530 --> 00:01:59,520
emptying the dishwasher,
helping you cook.

48
00:01:59,520 --> 00:02:02,150
And this could have
applications for people

49
00:02:02,150 --> 00:02:03,320
in all aspects of life.

50
00:02:03,320 --> 00:02:06,005
Elders, people who are
disabled, or even people

51
00:02:06,005 --> 00:02:07,880
who are really busy and
don't feel like doing

52
00:02:07,880 --> 00:02:10,340
all the chores in their house.

53
00:02:10,340 --> 00:02:13,340
So the aim of my
research program

54
00:02:13,340 --> 00:02:16,670
is to enable humans and
robots to collaborate together

55
00:02:16,670 --> 00:02:18,680
on complex tasks.

56
00:02:18,680 --> 00:02:21,500
And I'm going to talk about
the three big problems

57
00:02:21,500 --> 00:02:24,600
that I think we need to
solve to make this happen.

58
00:02:24,600 --> 00:02:26,690
So the first problem
is that you need

59
00:02:26,690 --> 00:02:29,960
to be able to have a robot that
can robustly perform actions

60
00:02:29,960 --> 00:02:32,292
in real-world environments.

61
00:02:32,292 --> 00:02:34,500
And we're seeing more and
more progress in this area,

62
00:02:34,500 --> 00:02:38,030
but the house is kind of
like this grand challenge.

63
00:02:38,030 --> 00:02:41,247
And John was talking about
kind of all these edge cases.

64
00:02:41,247 --> 00:02:42,830
So I'm going to talk
about an approach

65
00:02:42,830 --> 00:02:46,370
that we're taking
to try to increase

66
00:02:46,370 --> 00:02:49,100
the robustness and also
the diversity of actions

67
00:02:49,100 --> 00:02:51,350
that a robot can take in
a real-world environment

68
00:02:51,350 --> 00:02:54,320
by taking an
instance-based approach.

69
00:02:54,320 --> 00:02:56,150
Next, you need robots
that can carry out

70
00:02:56,150 --> 00:02:57,880
complex sequences of actions.

71
00:02:57,880 --> 00:03:00,710
So they need to be able
to plan in really, really

72
00:03:00,710 --> 00:03:03,406
large combinatorial
state-action spaces.

73
00:03:03,406 --> 00:03:05,780
There might be hundreds or
thousands of objects in a home

74
00:03:05,780 --> 00:03:07,600
that a robot might
need to manipulate

75
00:03:07,600 --> 00:03:09,800
and depending on whether
the person is doing laundry

76
00:03:09,800 --> 00:03:12,830
or they cooking broccoli
or are they making dessert.

77
00:03:12,830 --> 00:03:14,750
The set of objects
that are relevant that

78
00:03:14,750 --> 00:03:18,500
are useful that the robot needs
to worry about in order to help

79
00:03:18,500 --> 00:03:20,490
the person is wildly different.

80
00:03:20,490 --> 00:03:23,410
So we need new algorithms for
planning in this really large

81
00:03:23,410 --> 00:03:25,050
state- action space.

82
00:03:25,050 --> 00:03:26,480
And finally, the
robot needs to be

83
00:03:26,480 --> 00:03:29,520
able to figure out what people
want in the first place.

84
00:03:29,520 --> 00:03:32,600
So people communicate
using language, gesture,

85
00:03:32,600 --> 00:03:34,640
but also just by walking
around the environment

86
00:03:34,640 --> 00:03:37,400
and doing things that you can
infer something about what

87
00:03:37,400 --> 00:03:38,774
their intentions are.

88
00:03:38,774 --> 00:03:41,190
And critically, when people
communicate with other people,

89
00:03:41,190 --> 00:03:43,200
it's not an open loop
kind of communication.

90
00:03:43,200 --> 00:03:45,574
It's not like you send a
message and then close your eyes

91
00:03:45,574 --> 00:03:46,500
and hope for the best.

92
00:03:46,500 --> 00:03:48,500
People, when you're
talking with other people,

93
00:03:48,500 --> 00:03:51,020
engage in a closed
loop dialogue.

94
00:03:51,020 --> 00:03:53,390
There's feedback going on
in both directions that

95
00:03:53,390 --> 00:03:57,470
acts to detect and reduce
errors in the communication.

96
00:03:57,470 --> 00:03:59,450
And this is a critical
thing for robots

97
00:03:59,450 --> 00:04:02,060
to exploit because robots
have a lot more problems

98
00:04:02,060 --> 00:04:04,620
than people do in terms of
perceiving the environment

99
00:04:04,620 --> 00:04:05,870
and acting in the environment.

100
00:04:05,870 --> 00:04:08,284
So it's really important
that we establish

101
00:04:08,284 --> 00:04:10,700
some kind of feedback loop
between the human and the robot

102
00:04:10,700 --> 00:04:13,310
so that the robot can
infer what the person wants

103
00:04:13,310 --> 00:04:15,830
and carry out helpful actions.

104
00:04:15,830 --> 00:04:17,310
So the three parts
of the talk are

105
00:04:17,310 --> 00:04:19,579
going to be about each
of these three things.

106
00:04:19,579 --> 00:04:22,800
So this is my dad's
pantry in a home.

107
00:04:22,800 --> 00:04:25,430
And it's kind of like John's
pictures of the Google car.

108
00:04:25,430 --> 00:04:29,000
Most robots can't pick up
most objects most of the time.

109
00:04:29,000 --> 00:04:32,210
It's really hard to imagine
a robot doing anything

110
00:04:32,210 --> 00:04:34,079
with a scene like this one.

111
00:04:34,079 --> 00:04:35,870
There was just the
Amazon picking challenge

112
00:04:35,870 --> 00:04:38,740
and the team that won
used a vacuum cleaner,

113
00:04:38,740 --> 00:04:40,700
not a gripper to
pick up the objects.

114
00:04:40,700 --> 00:04:43,700
They literally sucked the
things up in the gripper

115
00:04:43,700 --> 00:04:47,330
and then turned the vacuum
cleaner off to put things down.

116
00:04:47,330 --> 00:04:51,650
And the Amazon challenge
had much, much sparser stuff

117
00:04:51,650 --> 00:04:52,286
on the shelves.

118
00:04:52,286 --> 00:04:54,410
We'd really like to be able
to do things like this.

119
00:04:56,912 --> 00:04:58,370
And what we're
doing now I'm really

120
00:04:58,370 --> 00:05:02,280
going to focus on a sub-problem,
which is object delivery.

121
00:05:02,280 --> 00:05:05,600
So from my perspective, I
think a really important sort

122
00:05:05,600 --> 00:05:08,450
of baseline capability
for a manipulator robot

123
00:05:08,450 --> 00:05:11,360
is to be able to pick something
up and move it somewhere else.

124
00:05:11,360 --> 00:05:13,160
We'd obviously love
a lot more things.

125
00:05:13,160 --> 00:05:15,920
We're also talking about
buttoning shirts in the car.

126
00:05:15,920 --> 00:05:18,154
And you can go on with
all the things you

127
00:05:18,154 --> 00:05:19,320
might want your robot to do.

128
00:05:19,320 --> 00:05:21,710
But at least, we'd like to
be able to do pick and place.

129
00:05:21,710 --> 00:05:23,782
Pick it up and put it down.

130
00:05:23,782 --> 00:05:25,740
So maybe you are in a
factory delivering tools,

131
00:05:25,740 --> 00:05:28,970
or maybe you're in the
kitchen delivering stuff

132
00:05:28,970 --> 00:05:32,810
like ingredients or
cooking utensils.

133
00:05:32,810 --> 00:05:35,536
So to do pick and place in
response to natural language

134
00:05:35,536 --> 00:05:37,910
commands-- so let's say, hand
me the knife or something--

135
00:05:37,910 --> 00:05:40,370
you need to know a few
things about the object.

136
00:05:40,370 --> 00:05:42,850
First of all, you need to
be able to know what it is.

137
00:05:42,850 --> 00:05:45,527
If they said, hand
me the ruler, you

138
00:05:45,527 --> 00:05:48,110
need to be able to know whether
or not this object is a ruler.

139
00:05:48,110 --> 00:05:49,940
So some kind of label
that can hook up

140
00:05:49,940 --> 00:05:51,920
to some kind of language model.

141
00:05:51,920 --> 00:05:55,719
Second, you have to know where
the object is in the world

142
00:05:55,719 --> 00:05:57,260
because you're going
to actually move

143
00:05:57,260 --> 00:05:59,270
your grippers and your
object and yourself

144
00:05:59,270 --> 00:06:01,710
through 3D space in order
to find that object.

145
00:06:01,710 --> 00:06:04,160
So here I'm going to highlight
the pixels of the object.

146
00:06:04,160 --> 00:06:05,810
But you have to
register those pixels

147
00:06:05,810 --> 00:06:08,240
into some kind of
coordinate system

148
00:06:08,240 --> 00:06:10,920
that lets you move your
gripper over to that object.

149
00:06:10,920 --> 00:06:12,982
And then third, you have
to know where on that

150
00:06:12,982 --> 00:06:14,690
object are you going
to put your gripper.

151
00:06:14,690 --> 00:06:17,510
So in the case of this
ruler, it's pretty heavy,

152
00:06:17,510 --> 00:06:19,420
and it's this funny
shape that doesn't

153
00:06:19,420 --> 00:06:20,420
have very good friction.

154
00:06:20,420 --> 00:06:23,180
So for our robot, the
best place to pick it up

155
00:06:23,180 --> 00:06:24,677
is in the middle of the object.

156
00:06:24,677 --> 00:06:26,510
And there might be more
than one good place,

157
00:06:26,510 --> 00:06:28,910
and it might depend
on the gripper.

158
00:06:28,910 --> 00:06:32,150
And different objects
might have complex things

159
00:06:32,150 --> 00:06:36,510
going on that change where the
right place is to pick it up.

160
00:06:36,510 --> 00:06:38,490
So conventional
approaches to this problem

161
00:06:38,490 --> 00:06:40,680
fall into two
general categories.

162
00:06:40,680 --> 00:06:44,240
The first category, the
first high-level approach

163
00:06:44,240 --> 00:06:46,970
is what I'm going to call
category-based grasping.

164
00:06:46,970 --> 00:06:47,840
This is the dream.

165
00:06:47,840 --> 00:06:50,510
So the dream is that you
walk up to your robot,

166
00:06:50,510 --> 00:06:53,580
you hand it an object that
it's never seen before,

167
00:06:53,580 --> 00:06:56,210
and the robot infers all
three of those things, what

168
00:06:56,210 --> 00:06:59,090
it is, where it is, and
where to put the gripper.

169
00:06:59,090 --> 00:07:01,190
And there's a line of
work that does this.

170
00:07:01,190 --> 00:07:03,140
So this is one paper
from Ashutosh Saxena,

171
00:07:03,140 --> 00:07:05,180
and there's a bunch of others.

172
00:07:05,180 --> 00:07:07,520
The problem is that it
doesn't work well enough.

173
00:07:07,520 --> 00:07:09,950
We are not at the accuracy
rates that sort of John

174
00:07:09,950 --> 00:07:12,460
was alluding to that
we need for driving.

175
00:07:12,460 --> 00:07:15,575
So in Ashutosh's paper, I
think they got 70% or 80%

176
00:07:15,575 --> 00:07:19,430
pick success rate on their
particular test set at doing

177
00:07:19,430 --> 00:07:21,639
category-based grasping.

178
00:07:21,639 --> 00:07:23,180
And I think that
you're going to have

179
00:07:23,180 --> 00:07:26,210
to expect that to fall if
you actually give it a wider

180
00:07:26,210 --> 00:07:27,890
array of objects in the home.

181
00:07:27,890 --> 00:07:30,290
And even if it doesn't
fall, 80% means

182
00:07:30,290 --> 00:07:33,950
it's dropping things 20% of the
time, and that's not so good.

183
00:07:33,950 --> 00:07:36,720
The second approach is
instance-based grasping.

184
00:07:36,720 --> 00:07:38,822
So I was talking to Eric
Sudderth in my department

185
00:07:38,822 --> 00:07:40,280
who does machine
learning, he said,

186
00:07:40,280 --> 00:07:42,990
instant recognition is a solved
problem in computer vision.

187
00:07:42,990 --> 00:07:46,715
So instant recognition is I give
you a training set of the slide

188
00:07:46,715 --> 00:07:49,970
flipper, lots of images
of it, and then your job

189
00:07:49,970 --> 00:07:53,610
given a new picture is to draw
a little box around the slide

190
00:07:53,610 --> 00:07:54,290
flipper.

191
00:07:54,290 --> 00:07:56,070
This is considered a solved
problem in computer vision.

192
00:07:56,070 --> 00:07:57,502
There is a data
set and a corpus,

193
00:07:57,502 --> 00:07:59,210
and the performance
maxed out, and people

194
00:07:59,210 --> 00:08:00,680
have stopped working on it.

195
00:08:00,680 --> 00:08:02,210
And a lot of the
work in robotics

196
00:08:02,210 --> 00:08:03,660
uses this kind of approach.

197
00:08:03,660 --> 00:08:06,170
We were talking about you have
some kind of geometric model.

198
00:08:06,170 --> 00:08:07,320
That's the instance-based model.

199
00:08:07,320 --> 00:08:09,278
These models can take a
lot of different forms.

200
00:08:09,278 --> 00:08:11,990
They can be an image or a
3D model or whatever it is.

201
00:08:11,990 --> 00:08:14,010
The problem is, where
do you get that model.

202
00:08:14,010 --> 00:08:17,120
So if I am in my house
and there is thousands

203
00:08:17,120 --> 00:08:19,520
of different objects,
you're not going

204
00:08:19,520 --> 00:08:22,130
to have the 3D model most
likely for the object

205
00:08:22,130 --> 00:08:24,600
that you want to pick up
right now for the person.

206
00:08:24,600 --> 00:08:26,030
So there's this
sort of data grab.

207
00:08:26,030 --> 00:08:28,910
But if you do have the model, it
can be really, really accurate

208
00:08:28,910 --> 00:08:31,070
because you can know
a lot about the object

209
00:08:31,070 --> 00:08:32,750
that you're trying to pick up.

210
00:08:32,750 --> 00:08:34,640
So the contribution
of our approach

211
00:08:34,640 --> 00:08:38,600
is to try to bridge these
two by enabling a robot

212
00:08:38,600 --> 00:08:41,630
to get the accuracy of the
instance-based approach

213
00:08:41,630 --> 00:08:44,390
by autonomously
collecting its own data

214
00:08:44,390 --> 00:08:47,684
that it needs in order to
robustly manipulate objects.

215
00:08:47,684 --> 00:08:49,100
So we're going to
get the accuracy

216
00:08:49,100 --> 00:08:51,830
of instance-based approach
and the generality of category

217
00:08:51,830 --> 00:08:54,650
at the cost of not human
time, but robot time

218
00:08:54,650 --> 00:08:56,100
to build this model.

219
00:08:56,100 --> 00:08:58,410
So here's what it looks
like on our Baxter.

220
00:08:58,410 --> 00:09:01,550
It's going to make
a point cloud.

221
00:09:01,550 --> 00:09:06,460
This is showing-- it's got a one
pixel connect in its gripper.

222
00:09:06,460 --> 00:09:08,719
So you're seeing it doing
a sort of raster scan.

223
00:09:08,719 --> 00:09:10,260
This is sped up to
get a point cloud.

224
00:09:10,260 --> 00:09:12,110
Now it's taking
images of the object.

225
00:09:12,110 --> 00:09:14,414
So it's got an RGB
camera in its wrist.

226
00:09:14,414 --> 00:09:15,830
It's taking pictures
of the object

227
00:09:15,830 --> 00:09:17,390
from lots of different
perspectives.

228
00:09:17,390 --> 00:09:19,370
So the data looks like this.

229
00:09:19,370 --> 00:09:21,652
You segment out the object
from the background.

230
00:09:21,652 --> 00:09:23,360
You get lots and lots
and lots of images.

231
00:09:23,360 --> 00:09:25,109
You do completely
standard computer vision

232
00:09:25,109 --> 00:09:28,717
stuff, SIFT and kNN, to make
a detector out of this data.

233
00:09:28,717 --> 00:09:29,800
You can get a point cloud.

234
00:09:29,800 --> 00:09:31,216
This is the point
cloud looks like

235
00:09:31,216 --> 00:09:33,210
at one-centimeter resolution.

236
00:09:33,210 --> 00:09:37,290
And after we do this, we're
able to pick up lots of stuff.

237
00:09:37,290 --> 00:09:39,100
So this is showing our robot--

238
00:09:39,100 --> 00:09:41,120
these two objects,
localizing the object

239
00:09:41,120 --> 00:09:43,010
and picking things up.

240
00:09:43,010 --> 00:09:44,580
It's going to pick up the egg.

241
00:09:44,580 --> 00:09:49,160
And that's a practice EpiPen.

242
00:09:49,160 --> 00:09:52,310
There's a little shake to make
sure it's got a good grasp.

243
00:09:52,310 --> 00:09:54,800
Now this works on
a lot of objects,

244
00:09:54,800 --> 00:09:56,930
so let's see how it
does on the ruler.

245
00:09:56,930 --> 00:09:59,060
So the way that the
system is working

246
00:09:59,060 --> 00:10:01,740
is it's using the
point cloud to infer

247
00:10:01,740 --> 00:10:03,030
where to grasp the object.

248
00:10:03,030 --> 00:10:06,720
But we don't really have a
model of physics or friction

249
00:10:06,720 --> 00:10:07,380
or slippage.

250
00:10:07,380 --> 00:10:08,969
So it infers a
grasp near the end

251
00:10:08,969 --> 00:10:10,260
because it fits in the gripper.

252
00:10:10,260 --> 00:10:11,430
It kind of looks like
it's going to work.

253
00:10:11,430 --> 00:10:13,950
And it does fit in the
gripper, but when we go and do

254
00:10:13,950 --> 00:10:15,706
that shake, what's
going to happen

255
00:10:15,706 --> 00:10:17,580
is it's going to pop
right out of the gripper

256
00:10:17,580 --> 00:10:20,250
because it's got this
relatively low friction.

257
00:10:20,250 --> 00:10:22,330
There it goes and falls out.

258
00:10:22,330 --> 00:10:23,220
So that's bad, right?

259
00:10:23,220 --> 00:10:26,630
We don't really like it
when our robots drop things.

260
00:10:26,630 --> 00:10:30,366
So before training, what happens
is it falls out of the gripper.

261
00:10:30,366 --> 00:10:31,740
In the case of
the ruler, there's

262
00:10:31,740 --> 00:10:33,000
sort of physics going on, right?

263
00:10:33,000 --> 00:10:34,541
Things are slipping
out, and maybe we

264
00:10:34,541 --> 00:10:36,270
should be doing
physical reasoning.

265
00:10:36,270 --> 00:10:37,590
I think we should be
doing physical reasoning.

266
00:10:37,590 --> 00:10:38,760
I won't say "maybe" about that.

267
00:10:38,760 --> 00:10:40,134
But we're not
doing it right now.

268
00:10:40,134 --> 00:10:42,135
And there's lots of
reasons things could fail.

269
00:10:42,135 --> 00:10:44,542
So in other problematic
objects, this

270
00:10:44,542 --> 00:10:45,750
is one of those salt shakers.

271
00:10:45,750 --> 00:10:49,080
It's got black handles that are
great for our robot to pick up,

272
00:10:49,080 --> 00:10:51,480
but they're black, so
they absorb the IR light,

273
00:10:51,480 --> 00:10:52,600
so we can't see them.

274
00:10:52,600 --> 00:10:54,975
So we can't figure out that
we're supposed to grab there.

275
00:10:54,975 --> 00:10:56,250
That round bulb looks awesome.

276
00:10:56,250 --> 00:10:57,990
It's transparent
though, so you get

277
00:10:57,990 --> 00:10:59,250
all these weird reflections.

278
00:10:59,250 --> 00:11:01,120
So the robots-- are
inference algorithms is like,

279
00:11:01,120 --> 00:11:03,161
oh, that bulb, that's
where we should pick it up.

280
00:11:03,161 --> 00:11:04,620
It doesn't fit in the gripper.

281
00:11:04,620 --> 00:11:06,700
So it will very often
slip out of the gripper.

282
00:11:06,700 --> 00:11:10,200
So what our approach to solve
this problem is, is we're

283
00:11:10,200 --> 00:11:11,849
going to let the robot practice.

284
00:11:11,849 --> 00:11:14,140
So we have-- I'm not going
to go through the algorithm,

285
00:11:14,140 --> 00:11:18,250
but we have this unarmed
bandit algorithm that

286
00:11:18,250 --> 00:11:22,300
lets us systematically decide
where we should pick objects

287
00:11:22,300 --> 00:11:22,800
up.

288
00:11:22,800 --> 00:11:26,610
You can give it a prior on
where you think good graphs are.

289
00:11:26,610 --> 00:11:28,110
And you can use
whatever information

290
00:11:28,110 --> 00:11:30,210
you want in that prior.

291
00:11:30,210 --> 00:11:32,850
And if the prior was perfect,
this would be boring.

292
00:11:32,850 --> 00:11:36,100
It would just work the first
time, and life would go on.

293
00:11:36,100 --> 00:11:38,730
But if the prior's
wrong for any reason,

294
00:11:38,730 --> 00:11:41,880
the robot will be able to detect
it and fix things up and learn

295
00:11:41,880 --> 00:11:44,525
where the most reliable places
are to pick up those objects.

296
00:11:44,525 --> 00:11:47,550
So here's an example
of what happens

297
00:11:47,550 --> 00:11:48,990
when we use this algorithm.

298
00:11:48,990 --> 00:11:51,390
We practice picking
up the ruler.

299
00:11:51,390 --> 00:11:52,730
I forget how many times it had.

300
00:11:52,730 --> 00:11:55,350
Maybe 20 on this
particular object.

301
00:11:55,350 --> 00:11:57,060
One of the rifts
on the algorithm

302
00:11:57,060 --> 00:11:58,660
is it decides when to stop.

303
00:11:58,660 --> 00:12:01,110
So we go a maximum 50 picks.

304
00:12:01,110 --> 00:12:03,600
But we might stop after
three if all three of them

305
00:12:03,600 --> 00:12:06,010
work so that you can go on
to the next object to train.

306
00:12:06,010 --> 00:12:09,810
So here it picks up in the
middle and does a nice shake.

307
00:12:09,810 --> 00:12:13,300
OK, so what we're doing now is
scaling up this whole thing.

308
00:12:13,300 --> 00:12:16,500
So this is showing our robot
practicing on lots and lots

309
00:12:16,500 --> 00:12:17,702
of different objects.

310
00:12:17,702 --> 00:12:18,660
A lot of them are toys.

311
00:12:18,660 --> 00:12:20,820
My son likes to watch
this video because he

312
00:12:20,820 --> 00:12:23,700
likes to see the robot
playing with all of his toys.

313
00:12:23,700 --> 00:12:25,387
And I think playing
is actually--

314
00:12:25,387 --> 00:12:27,720
I mean it's one of those
loaded cognitive science words,

315
00:12:27,720 --> 00:12:30,561
but I think that's
an interesting way

316
00:12:30,561 --> 00:12:33,060
to think about what the robots
are actually doing right now.

317
00:12:33,060 --> 00:12:35,070
It's doing little
experiments trying

318
00:12:35,070 --> 00:12:37,260
to pick up these objects
in different places

319
00:12:37,260 --> 00:12:39,580
and recording where it works
and where it doesn't work.

320
00:12:39,580 --> 00:12:43,350
So this is sort of
showing 16, 32, one

321
00:12:43,350 --> 00:12:47,220
in each hand objects being
done in our initial evaluation.

322
00:12:47,220 --> 00:12:50,040
And at the end of this,
basically, it works.

323
00:12:50,040 --> 00:12:55,140
So this is all the
objects in our test set.

324
00:12:55,140 --> 00:12:58,980
And before learning, we were
able to do with this proposal

325
00:12:58,980 --> 00:13:02,745
system, which uses the steps
information, we get about 50%

326
00:13:02,745 --> 00:13:04,200
pick success rate.

327
00:13:04,200 --> 00:13:06,690
After learning,
they go up to 75%.

328
00:13:06,690 --> 00:13:09,000
And the other really
cool thing is that this

329
00:13:09,000 --> 00:13:10,890
is a bimodal distribution.

330
00:13:10,890 --> 00:13:16,020
So it doesn't say 75% is
what you're going to get.

331
00:13:16,020 --> 00:13:19,350
A lot of these objects worked
eight, nine out of 10 times

332
00:13:19,350 --> 00:13:20,450
or 10 out of 10 times.

333
00:13:20,450 --> 00:13:22,041
It goes from worst to best.

334
00:13:22,041 --> 00:13:23,540
So the good stuff
is all over there,

335
00:13:23,540 --> 00:13:25,200
and the hard stuff
is all over there.

336
00:13:25,200 --> 00:13:26,910
A lot of other objects
were really hard.

337
00:13:26,910 --> 00:13:29,160
So that garlic press I think
we picked it up one time.

338
00:13:29,160 --> 00:13:32,620
It's really, really heavy,
so it slips out a lot.

339
00:13:32,620 --> 00:13:35,870
That gyro-ball thing
has a lot of reflection,

340
00:13:35,870 --> 00:13:37,720
so we had trouble
localizing it accurately.

341
00:13:37,720 --> 00:13:40,034
So we picked it
up very few times.

342
00:13:40,034 --> 00:13:41,700
I think everything
from about the EpiPen

343
00:13:41,700 --> 00:13:43,158
over was eight out
of 10 or better.

344
00:13:43,158 --> 00:13:46,580
So not only-- so there's a lot
of objects that we can pick up,

345
00:13:46,580 --> 00:13:48,330
and we can know which
ones we can pick up.

346
00:13:48,330 --> 00:13:49,680
And which ones we can't.

347
00:13:49,680 --> 00:13:52,650
We are right now taking an
aggressively instance-based

348
00:13:52,650 --> 00:13:53,222
approach.

349
00:13:53,222 --> 00:13:54,930
And the reason that
we're doing that is I

350
00:13:54,930 --> 00:13:57,138
think there's something
magic when the robot actually

351
00:13:57,138 --> 00:13:58,170
picks something up.

352
00:13:58,170 --> 00:14:01,380
So where I wanted to
start is let's cheat

353
00:14:01,380 --> 00:14:02,340
in every way we can.

354
00:14:02,340 --> 00:14:04,020
Let's completely
make a model that's

355
00:14:04,020 --> 00:14:06,990
totally specific to
this particular object.

356
00:14:06,990 --> 00:14:08,460
But the next step
that we're doing

357
00:14:08,460 --> 00:14:10,890
is to try to scale
up this whole thing

358
00:14:10,890 --> 00:14:14,040
and then start to think about
more general models to go back

359
00:14:14,040 --> 00:14:16,770
to that dream of
category-based recognition.

360
00:14:16,770 --> 00:14:19,530
So if you look at computer
vision success stories,

361
00:14:19,530 --> 00:14:23,340
one of the things that makes
a lot of algorithms successful

362
00:14:23,340 --> 00:14:24,540
is data sets.

363
00:14:24,540 --> 00:14:26,585
And the size of those
data sets is immense.

364
00:14:26,585 --> 00:14:28,210
A lot of the computer
vision data sets,

365
00:14:28,210 --> 00:14:30,900
COCO DB from Microsoft,
have millions

366
00:14:30,900 --> 00:14:34,110
of images, which are labeled
with where the object is.

367
00:14:34,110 --> 00:14:36,060
But most of those
images are taken

368
00:14:36,060 --> 00:14:38,610
by a human photographer
on your cell phone

369
00:14:38,610 --> 00:14:39,750
or uploaded to Flicker.

370
00:14:39,750 --> 00:14:42,130
Wherever they got them from.

371
00:14:42,130 --> 00:14:44,130
And you get to see
each object once.

372
00:14:44,130 --> 00:14:46,140
Maybe you see it twice
from one perspective

373
00:14:46,140 --> 00:14:47,931
that a human carefully chose.

374
00:14:47,931 --> 00:14:49,180
You don't get to play with it.

375
00:14:49,180 --> 00:14:50,570
You don't get to manipulate it.

376
00:14:50,570 --> 00:14:54,210
In robotics, there's some
data sets of object instances.

377
00:14:54,210 --> 00:14:56,850
The largest ones have a
few hundred of objects.

378
00:14:56,850 --> 00:14:58,530
So computer vision
people that I've

379
00:14:58,530 --> 00:15:01,450
talked to they laugh at it
because it's just so much

380
00:15:01,450 --> 00:15:04,540
smaller compared to the data
sets that we're working with.

381
00:15:04,540 --> 00:15:06,940
I think it's also so
much smaller than what

382
00:15:06,940 --> 00:15:10,930
a human child gets to play
with over the course of going

383
00:15:10,930 --> 00:15:12,487
from zero to two years old.

384
00:15:12,487 --> 00:15:15,070
I guess my son became a mobile
manipulator around a year later

385
00:15:15,070 --> 00:15:16,236
around one and a half or so.

386
00:15:16,236 --> 00:15:18,110
I'm not sure exactly when.

387
00:15:18,110 --> 00:15:20,770
So one of my goals is to
scale up this whole thing

388
00:15:20,770 --> 00:15:24,190
to change this data equation
to be more in our favor.

389
00:15:24,190 --> 00:15:27,220
So there's about 300 of these--
this is the Baxter robot--

390
00:15:27,220 --> 00:15:29,920
there's about 300 of them
that Rethink Robotics--

391
00:15:29,920 --> 00:15:31,240
Rod Brooks-- so we were
talking about Rob Brooks

392
00:15:31,240 --> 00:15:32,115
in the previous talk.

393
00:15:32,115 --> 00:15:34,270
Rod founded this company
Rethink Robotics.

394
00:15:34,270 --> 00:15:37,240
They've sold about 300 of
them to the robotics research

395
00:15:37,240 --> 00:15:37,740
community.

396
00:15:37,740 --> 00:15:40,630
That's a very high penetration
rate in robotics research.

397
00:15:40,630 --> 00:15:45,680
Everybody has a Baxter or
a friend with a Baxter.

398
00:15:45,680 --> 00:15:47,650
So we're starting something
which we're calling

399
00:15:47,650 --> 00:15:49,900
the million object challenge.

400
00:15:49,900 --> 00:15:52,540
And the goal is to enlist
all of those Baxters, which

401
00:15:52,540 --> 00:15:54,880
are sitting around doing
nothing a lot of the time--

402
00:15:54,880 --> 00:15:57,154
to change this data equation.

403
00:15:57,154 --> 00:15:58,570
So what we're doing
is we're going

404
00:15:58,570 --> 00:16:01,120
to try to get everybody
to scan objects for us,

405
00:16:01,120 --> 00:16:03,340
so that we can get
models, perceptual models,

406
00:16:03,340 --> 00:16:06,490
visual models, and also
manipulation experiences

407
00:16:06,490 --> 00:16:09,880
with these objects to try to
train new and better category

408
00:16:09,880 --> 00:16:10,860
models.

409
00:16:10,860 --> 00:16:12,970
And I think even
existing algorithms

410
00:16:12,970 --> 00:16:15,350
may work way better simply
because they have better data.

411
00:16:15,350 --> 00:16:16,891
But I think it also
opens up the door

412
00:16:16,891 --> 00:16:19,364
to thinking about better models
that we maybe couldn't even

413
00:16:19,364 --> 00:16:21,280
think about before because
we just didn't have

414
00:16:21,280 --> 00:16:23,020
the data to play with them.

415
00:16:23,020 --> 00:16:25,360
So where we are right now
is we've installed our stack

416
00:16:25,360 --> 00:16:27,790
at MIT on Daniela Rus's Baxter.

417
00:16:27,790 --> 00:16:28,810
That's this one.

418
00:16:28,810 --> 00:16:30,460
And we went down
to Yale a couple

419
00:16:30,460 --> 00:16:33,064
of weeks ago to Scass's lab
and we have our software

420
00:16:33,064 --> 00:16:33,730
on their Baxter.

421
00:16:33,730 --> 00:16:35,200
We're going to Rethink tomorrow.

422
00:16:35,200 --> 00:16:37,491
They're going to give us
three Baxters that we're going

423
00:16:37,491 --> 00:16:38,890
to play with and install there.

424
00:16:38,890 --> 00:16:41,350
And I have a verbal
yes from WPI.

425
00:16:41,350 --> 00:16:43,150
And a few other people
have been like--

426
00:16:43,150 --> 00:16:44,147
I pitched this at RSS.

427
00:16:44,147 --> 00:16:46,230
So a lot of people have
said they were interested.

428
00:16:46,230 --> 00:16:50,050
I don't know if they'll actually
translate to robot time.

429
00:16:50,050 --> 00:16:54,460
And our goal is to get
about 500 or 1,000 objects

430
00:16:54,460 --> 00:16:55,930
between these three sites.

431
00:16:55,930 --> 00:16:57,910
Four sites I guess if
the WPI gets on board.

432
00:16:57,910 --> 00:16:59,470
Four sites including us.

433
00:16:59,470 --> 00:17:02,200
So Rethink, Yale, MIT and us.

434
00:17:02,200 --> 00:17:05,470
And then do like a larger press
release about the project.

435
00:17:05,470 --> 00:17:08,140
Advertise it, push over all
of our friends with Baxters

436
00:17:08,140 --> 00:17:09,369
to help us scan.

437
00:17:09,369 --> 00:17:12,579
And then have yearly scanathons
where you download the latest

438
00:17:12,579 --> 00:17:15,640
software and then spend a
couple of days scanning objects

439
00:17:15,640 --> 00:17:20,680
for the glory of
robotics or something.

440
00:17:20,680 --> 00:17:23,200
And really try to change this
data equation for the better,

441
00:17:23,200 --> 00:17:24,699
so we can manipulate
lots of things.

442
00:17:27,230 --> 00:17:30,676
So that's our plan
for making robots

443
00:17:30,676 --> 00:17:32,050
that can robustly
perform actions

444
00:17:32,050 --> 00:17:33,216
and real-world environments.

445
00:17:33,216 --> 00:17:35,140
More generally, I imagine
like a mobile robot

446
00:17:35,140 --> 00:17:36,850
walking around
your house at night

447
00:17:36,850 --> 00:17:38,890
and scanning stuff
completely autonomously.

448
00:17:38,890 --> 00:17:41,223
Taking these pictures, building
these models, hopefully,

449
00:17:41,223 --> 00:17:43,240
not breaking too
much of your stuff.

450
00:17:43,240 --> 00:17:47,710
And not only learning about your
particular house and the things

451
00:17:47,710 --> 00:17:49,870
that are in it, but
also collecting data

452
00:17:49,870 --> 00:17:54,820
that will enable other robots
to perform better over time.

453
00:17:54,820 --> 00:17:57,070
All right, so that's our
attack on making robots

454
00:17:57,070 --> 00:18:01,450
robustly perform actions
in real-world environments.

455
00:18:01,450 --> 00:18:03,910
So the next problem that
I think is important

456
00:18:03,910 --> 00:18:06,700
for language understanding
of human robot collaboration

457
00:18:06,700 --> 00:18:10,640
is making robots to carry out
complex sequences of actions.

458
00:18:10,640 --> 00:18:12,692
So for example, this
is this pantry again.

459
00:18:12,692 --> 00:18:14,650
There might be hundreds
or thousands of objects

460
00:18:14,650 --> 00:18:16,900
that the robot could
potentially manipulate.

461
00:18:16,900 --> 00:18:18,550
And it might need
to do a sequence

462
00:18:18,550 --> 00:18:20,200
of 10 or 20
manipulations in order

463
00:18:20,200 --> 00:18:22,510
to solve a problem such
as clean up the kitchen

464
00:18:22,510 --> 00:18:24,850
or put away the groceries.

465
00:18:24,850 --> 00:18:27,370
For work that I had done in
the past on the forklift a lot

466
00:18:27,370 --> 00:18:30,850
of the commands that
we studied and thought

467
00:18:30,850 --> 00:18:32,980
about were the level
of abstraction of put

468
00:18:32,980 --> 00:18:34,610
the pallet on the truck.

469
00:18:34,610 --> 00:18:36,860
But one of our annotators--
we cleared the law of data

470
00:18:36,860 --> 00:18:38,412
on Amazon Mechanical Turk.

471
00:18:38,412 --> 00:18:40,870
And one of our annotators gave
us this problem that I never

472
00:18:40,870 --> 00:18:43,030
forgot, which was how--

473
00:18:43,030 --> 00:18:44,770
it was the actual
forklift operator

474
00:18:44,770 --> 00:18:46,660
who worked in a
warehouse, and he

475
00:18:46,660 --> 00:18:49,120
said if you paid me
extra money, I'll

476
00:18:49,120 --> 00:18:52,300
tell you how to pick up a dime--
a dime, like a little coin

477
00:18:52,300 --> 00:18:53,500
with a forklift.

478
00:18:53,500 --> 00:18:55,930
Here's the instructions
that he eventually

479
00:18:55,930 --> 00:18:58,140
without making us
pay him gave us

480
00:18:58,140 --> 00:18:59,390
for how to solve this problem.

481
00:18:59,390 --> 00:19:01,304
So it was raise the
forks 12 inches,

482
00:19:01,304 --> 00:19:03,220
line it in front of the
dime, tilt it forward,

483
00:19:03,220 --> 00:19:05,665
drive a little bit
over, you lower the fork

484
00:19:05,665 --> 00:19:08,224
on top of the dime,
put it in reverse

485
00:19:08,224 --> 00:19:09,640
and travel backward,
the dime kind

486
00:19:09,640 --> 00:19:14,440
of flips up backwards
on top of the fork.

487
00:19:14,440 --> 00:19:16,240
Maybe you know how
to drive a forklift,

488
00:19:16,240 --> 00:19:17,917
but you can see how
that would work.

489
00:19:17,917 --> 00:19:19,750
And if you did know how
to drive a forklift,

490
00:19:19,750 --> 00:19:22,120
you can follow
those instructions

491
00:19:22,120 --> 00:19:23,350
and have it happen.

492
00:19:23,350 --> 00:19:25,950
But I knew that our system if
we gave it these commands, there

493
00:19:25,950 --> 00:19:28,119
is no way that it would work.

494
00:19:28,119 --> 00:19:29,410
It would completely fall apart.

495
00:19:29,410 --> 00:19:31,480
And the reason that
it would fall apart

496
00:19:31,480 --> 00:19:34,120
is that we gave the
robot a model of actions

497
00:19:34,120 --> 00:19:36,010
at a different
level of abstraction

498
00:19:36,010 --> 00:19:37,320
than this language is using.

499
00:19:37,320 --> 00:19:40,050
We gave it a very high-level
of abstract actions,

500
00:19:40,050 --> 00:19:43,360
like picking stuff up and moving
it into particular locations

501
00:19:43,360 --> 00:19:44,830
and moving things down.

502
00:19:44,830 --> 00:19:46,930
And if we gave it
these low-level actions

503
00:19:46,930 --> 00:19:49,310
of like raising the
forks 12 inches,

504
00:19:49,310 --> 00:19:50,980
the search steps that
would be required

505
00:19:50,980 --> 00:19:54,130
to find a high-level thing like
put the pallet on the truck

506
00:19:54,130 --> 00:19:55,709
would be prohibitively
expensive.

507
00:19:55,709 --> 00:19:58,000
And I think if we want to
have human-- but the thing is

508
00:19:58,000 --> 00:20:00,730
people don't like to stick at
any fixed level of abstraction.

509
00:20:00,730 --> 00:20:02,770
People move up and
down the tree freely.

510
00:20:02,770 --> 00:20:05,770
They give very high-level,
mid-level, low-level commands.

511
00:20:05,770 --> 00:20:08,110
So I think we need new
planning algorithms that

512
00:20:08,110 --> 00:20:09,260
support this kind of thing.

513
00:20:09,260 --> 00:20:10,690
So to think about
this, we decided

514
00:20:10,690 --> 00:20:13,810
to look at a version of
the problem in simulation.

515
00:20:13,810 --> 00:20:17,020
The simulator that we chose
is a game called Minecraft.

516
00:20:17,020 --> 00:20:18,970
Five minutes, OK.

517
00:20:18,970 --> 00:20:20,500
So it's sort of this--

518
00:20:20,500 --> 00:20:23,110
this is a picture from
a Minecraft world.

519
00:20:23,110 --> 00:20:25,910
And we're trying to figure
out new planning algorithms.

520
00:20:25,910 --> 00:20:27,880
So the problem here
is that the agent

521
00:20:27,880 --> 00:20:30,550
needs to cross the trench.

522
00:20:30,550 --> 00:20:33,080
So he needs to make a bridge
to get across the trench.

523
00:20:33,080 --> 00:20:35,080
So it's got some blocks
that he can manipulate.

524
00:20:35,080 --> 00:20:37,330
And you have this
combinatorial explosion

525
00:20:37,330 --> 00:20:39,220
of where the blocks can go.

526
00:20:39,220 --> 00:20:40,270
They can go anywhere.

527
00:20:40,270 --> 00:20:41,770
So in a naive
algorithm, we'll spend

528
00:20:41,770 --> 00:20:44,840
a lot of time putting
the blocks everywhere,

529
00:20:44,840 --> 00:20:46,720
which doesn't
really make progress

530
00:20:46,720 --> 00:20:47,950
towards solving a problem.

531
00:20:47,950 --> 00:20:49,366
Whereas what you
really need to do

532
00:20:49,366 --> 00:20:53,320
is focus on putting these
blocks actually in the trench

533
00:20:53,320 --> 00:20:55,324
in order to solve the problem.

534
00:20:55,324 --> 00:20:56,740
Of course, on a
different day, you

535
00:20:56,740 --> 00:20:58,600
might be asked to make
a tower or make a castle

536
00:20:58,600 --> 00:20:59,920
or make a staircase,
and then these

537
00:20:59,920 --> 00:21:01,060
might be good things to do.

538
00:21:01,060 --> 00:21:03,160
So you don't just throw
out those actions.

539
00:21:03,160 --> 00:21:05,380
You want to have them
both and figure out

540
00:21:05,380 --> 00:21:07,690
what to do based on
your high-level goal.

541
00:21:07,690 --> 00:21:10,400
So we have some work about
learning how to do this.

542
00:21:10,400 --> 00:21:12,430
So we have an agent
that practices

543
00:21:12,430 --> 00:21:14,920
solving small Minecraft
problems and then learns

544
00:21:14,920 --> 00:21:19,330
how to solve bigger
problems from experience.

545
00:21:19,330 --> 00:21:21,576
This is showing transferring
this from small problems

546
00:21:21,576 --> 00:21:23,950
to big problems in a decision
theoretic framework, an MDP

547
00:21:23,950 --> 00:21:25,572
framework.

548
00:21:25,572 --> 00:21:27,280
And we've just released
a couple of weeks

549
00:21:27,280 --> 00:21:30,280
ago a mod for Minecraft,
the game called BurlapCraft.

550
00:21:30,280 --> 00:21:33,590
BURLAP is our reinforcement
learning and planning framework

551
00:21:33,590 --> 00:21:35,170
that James MacGlashan
and Michael

552
00:21:35,170 --> 00:21:36,670
Littman developed in Java.

553
00:21:36,670 --> 00:21:39,640
So you can run BURLAP
inside the Minecraft JVM.

554
00:21:39,640 --> 00:21:42,460
Get the state of the
real Minecraft world.

555
00:21:42,460 --> 00:21:44,200
Make small toy
problems if you want.

556
00:21:44,200 --> 00:21:46,810
Or let your agent
go in the real thing

557
00:21:46,810 --> 00:21:50,200
and explore the whole space
of possible Minecraft spaces

558
00:21:50,200 --> 00:21:52,180
if you're interested
in that simulation.

559
00:21:52,180 --> 00:21:54,010
OK, I'm almost out
of time, so I'm not

560
00:21:54,010 --> 00:21:57,550
going to go too much into
robots coordinating with people.

561
00:21:57,550 --> 00:22:02,620
But maybe I will show some of
the videos about this work.

562
00:22:02,620 --> 00:22:05,650
The idea is that a lot of the
previous work and language

563
00:22:05,650 --> 00:22:10,420
understanding shows people
works in batch mode.

564
00:22:10,420 --> 00:22:13,630
So the robot does something--
the person says something,

565
00:22:13,630 --> 00:22:15,304
the robot thinks
for a long time,

566
00:22:15,304 --> 00:22:16,720
and then the robot
does something.

567
00:22:16,720 --> 00:22:18,250
Hopefully, the right thing.

568
00:22:18,250 --> 00:22:20,770
And as I said before, this
is not how people will work.

569
00:22:20,770 --> 00:22:23,456
So we're working on new
models that enable the robot--

570
00:22:23,456 --> 00:22:25,330
this is a graphical
model that shows how it--

571
00:22:25,330 --> 00:22:26,740
talking about it in the car.

572
00:22:26,740 --> 00:22:30,160
What happens, it incrementally
interprets language and gesture

573
00:22:30,160 --> 00:22:32,600
updating at very
high frequencies.

574
00:22:32,600 --> 00:22:35,260
So this is showing the
belief about which objects

575
00:22:35,260 --> 00:22:37,960
the person wants, updating
from their language and gesture

576
00:22:37,960 --> 00:22:39,880
in an animated
kind of way, right?

577
00:22:39,880 --> 00:22:42,310
Like, it's updating at 14 Hertz.

578
00:22:42,310 --> 00:22:44,719
So the idea is that the
robot has the information.

579
00:22:44,719 --> 00:22:45,760
This is its own language.

580
00:22:45,760 --> 00:22:46,670
I would like a bowl.

581
00:22:46,670 --> 00:22:48,040
Both bowls go up.

582
00:22:48,040 --> 00:22:51,040
That one he points and then
the one that he's pointing at

583
00:22:51,040 --> 00:22:51,850
goes up.

584
00:22:51,850 --> 00:22:53,897
So the robot knows
very, very quickly,

585
00:22:53,897 --> 00:22:56,230
every time we get a new word
from [INAUDIBLE] condition,

586
00:22:56,230 --> 00:22:58,850
every time we get a new
observation from the gesture

587
00:22:58,850 --> 00:23:00,680
system, we update our belief.

588
00:23:00,680 --> 00:23:04,450
And just a couple of weeks ago,
we had our first pilot results

589
00:23:04,450 --> 00:23:07,900
showing that we can use this
information to enable the robot

590
00:23:07,900 --> 00:23:11,770
to produce real-time feedback
that increases the human's

591
00:23:11,770 --> 00:23:14,150
accuracy at getting the robot
to select the right object.

592
00:23:17,570 --> 00:23:19,090
This is some
quantitative results.

593
00:23:19,090 --> 00:23:21,250
I'll skip it.

594
00:23:21,250 --> 00:23:24,162
OK, so that's the
three main thrusts

595
00:23:24,162 --> 00:23:25,870
that I'm working on
in my research group.

596
00:23:25,870 --> 00:23:28,316
Trying to make robots that
can robustly perform actions

597
00:23:28,316 --> 00:23:29,440
in real-world environments.

598
00:23:29,440 --> 00:23:32,590
Thinking about planning in
a really large state action

599
00:23:32,590 --> 00:23:36,019
spaces that result when you have
a capable and powerful robot.

600
00:23:36,019 --> 00:23:38,560
And then thinking about how you
can make the robot coordinate

601
00:23:38,560 --> 00:23:40,184
with people so that
they can figure out

602
00:23:40,184 --> 00:23:42,790
what to do in these really
large state actions spaces.

603
00:23:42,790 --> 00:23:44,550
Thank you.