1
00:00:01,640 --> 00:00:04,040
The following content is
provided under a Creative

2
00:00:04,040 --> 00:00:05,580
Commons license.

3
00:00:05,580 --> 00:00:07,880
Your support will help
MIT OpenCourseWare

4
00:00:07,880 --> 00:00:12,270
continue to offer high quality
educational resources for free.

5
00:00:12,270 --> 00:00:14,870
To make a donation or
view additional materials

6
00:00:14,870 --> 00:00:18,830
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,830 --> 00:00:21,991
at ocw.mit.edu.

8
00:00:21,991 --> 00:00:23,240
CARLO CILIBERTO: Good morning.

9
00:00:23,240 --> 00:00:26,330
So today, there is
a bit if a overview

10
00:00:26,330 --> 00:00:33,680
on the iCub Robot, and it
will be about one hour, one

11
00:00:33,680 --> 00:00:34,950
hour and half.

12
00:00:34,950 --> 00:00:39,120
And we organized the
schedule for this time

13
00:00:39,120 --> 00:00:44,660
in series of four small
talks and then a demo

14
00:00:44,660 --> 00:00:47,240
from the iCub, live demo.

15
00:00:47,240 --> 00:00:49,760
So I will give you an
overview of the kind of fields

16
00:00:49,760 --> 00:00:53,590
and capabilities that the
iCub has developed so far,

17
00:00:53,590 --> 00:00:56,810
while Alessandro, Rafaello,
and Giulia will show you

18
00:00:56,810 --> 00:00:58,850
what's going on right
now on the iCub, part

19
00:00:58,850 --> 00:01:00,760
of what's going on the robot.

20
00:01:00,760 --> 00:01:05,940
So they are going to talk
about their recent work.

21
00:01:05,940 --> 00:01:09,085
So let's start with
the presentation.

22
00:01:09,085 --> 00:01:12,140
This is the iCub, which
is a child humanoid robot.

23
00:01:12,140 --> 00:01:14,360
This size.

24
00:01:14,360 --> 00:01:21,660
And the project of the iCub
began in 2004, and the iCub--

25
00:01:21,660 --> 00:01:24,180
actually, the iCubs, because
there are many of them--

26
00:01:24,180 --> 00:01:27,260
they were built in Genoa
at the Italian Institute

27
00:01:27,260 --> 00:01:29,120
of Technology.

28
00:01:29,120 --> 00:01:33,575
And the main motivation behind
the creation and the design

29
00:01:33,575 --> 00:01:39,620
of this platform was
to actually have a way

30
00:01:39,620 --> 00:01:43,236
to have a platform in order
to study how intelligence,

31
00:01:43,236 --> 00:01:47,500
how cognition emerges in
artificial embodied systems.

32
00:01:47,500 --> 00:01:51,830
So Giulio Sandini and Giorgio
Metta, that you can see there,

33
00:01:51,830 --> 00:01:58,090
are the actual original
founders of iCub world,

34
00:01:58,090 --> 00:02:02,900
so they are both
directors at the IIT.

35
00:02:02,900 --> 00:02:05,690
And this is a bit of a
timeline that I drew.

36
00:02:05,690 --> 00:02:09,570
Actually, there are many
other things going on

37
00:02:09,570 --> 00:02:11,570
during all these 11 years.

38
00:02:11,570 --> 00:02:15,350
This is a video celebrating
the actual first 10 years

39
00:02:15,350 --> 00:02:16,460
of the project.

40
00:02:16,460 --> 00:02:20,146
And actually, you can
see many more things

41
00:02:20,146 --> 00:02:21,520
that the iCub will
be able to do.

42
00:02:21,520 --> 00:02:23,870
But I selected
this part because I

43
00:02:23,870 --> 00:02:26,810
think they can be useful, also,
if you are interested in doing

44
00:02:26,810 --> 00:02:29,870
projects with the robot, to have
an idea of the kind of skills

45
00:02:29,870 --> 00:02:33,340
and the kind of feedback that
the robot can provide you,

46
00:02:33,340 --> 00:02:35,180
to do an experiment.

47
00:02:35,180 --> 00:02:39,900
So as I told you, iCub
was built with the idea

48
00:02:39,900 --> 00:02:43,580
of replicating an artificial
embodied system that

49
00:02:43,580 --> 00:02:46,410
could explore the environment
and learn from it.

50
00:02:46,410 --> 00:02:48,752
So it has many
different sensors.

51
00:02:48,752 --> 00:02:49,710
These are some of them.

52
00:02:49,710 --> 00:02:54,650
So in the head, we have one
accelerometer and the gyroscope

53
00:02:54,650 --> 00:02:58,940
in order to provide the
inertial feedback to the system,

54
00:02:58,940 --> 00:03:02,750
and two Dragonfly cameras
that provide medium resolution

55
00:03:02,750 --> 00:03:03,950
images.

56
00:03:03,950 --> 00:03:08,180
And as you can see
there, it's all

57
00:03:08,180 --> 00:03:11,060
about one meter, one
meter and something,

58
00:03:11,060 --> 00:03:12,930
and it's pretty light--

59
00:03:12,930 --> 00:03:16,210
55 kilograms-- and has a
lot of degrees of freedom.

60
00:03:16,210 --> 00:03:18,440
So 53 degrees of
freedom, and they

61
00:03:18,440 --> 00:03:23,870
allow the robot to perform
many complicated actions.

62
00:03:23,870 --> 00:03:26,960
It's provided with
torque and force sensors,

63
00:03:26,960 --> 00:03:31,370
and I will go over
these in a minute.

64
00:03:31,370 --> 00:03:35,985
Its whole body, or at least
the covered part of the robot--

65
00:03:35,985 --> 00:03:37,610
the black part that
you can see there--

66
00:03:37,610 --> 00:03:39,920
are all covered in
artificial skin.

67
00:03:39,920 --> 00:03:42,230
So they provide
feedback about contact

68
00:03:42,230 --> 00:03:44,420
with the external world.

69
00:03:44,420 --> 00:03:47,090
And it has also microphones
mounted on the head,

70
00:03:47,090 --> 00:03:49,810
but probably for sound
and speech recognition

71
00:03:49,810 --> 00:03:54,350
is better to use direct
microfeedback at the moment,

72
00:03:54,350 --> 00:03:56,570
because, of course,
noise-canceling problems

73
00:03:56,570 --> 00:03:58,140
and so on.

74
00:03:58,140 --> 00:04:03,470
If you're interested in
speech and found feedback,

75
00:04:03,470 --> 00:04:07,256
we are going to use
other kind of microphone.

76
00:04:07,256 --> 00:04:09,740
So during these
11 years, iCub has

77
00:04:09,740 --> 00:04:11,510
been involved in
many, many projects,

78
00:04:11,510 --> 00:04:14,275
and indeed, part of what
I'm going to show you

79
00:04:14,275 --> 00:04:18,019
is the result of the joint
effort of many labs, mainly

80
00:04:18,019 --> 00:04:18,649
in Europe.

81
00:04:18,649 --> 00:04:20,399
These are mostly
European projects.

82
00:04:20,399 --> 00:04:24,440
But iCub is also an
international partner

83
00:04:24,440 --> 00:04:27,140
of the CBMM Project.

84
00:04:29,912 --> 00:04:31,940
So regarding
force/torque sensors,

85
00:04:31,940 --> 00:04:35,997
they are these sensors
that you can see there.

86
00:04:35,997 --> 00:04:44,520
So they provide [INAUDIBLE]
and also [INAUDIBLE]..

87
00:04:48,471 --> 00:04:51,070
And they're mounted
in each of the limbs

88
00:04:51,070 --> 00:04:52,780
of the robot and the torso.

89
00:04:52,780 --> 00:04:59,065
And they allow the robot
to [INAUDIBLE] interaction

90
00:04:59,065 --> 00:04:59,690
with the world.

91
00:04:59,690 --> 00:05:01,640
And indeed, with
this kind of object,

92
00:05:01,640 --> 00:05:03,100
it can do many different things.

93
00:05:03,100 --> 00:05:06,480
For instance in this video,
I'm showing an example

94
00:05:06,480 --> 00:05:10,050
of how the feedback provided
by the force/torque sensor

95
00:05:10,050 --> 00:05:13,540
can be used to guide the
robot and teach it to learn

96
00:05:13,540 --> 00:05:15,270
different kind of action.

97
00:05:15,270 --> 00:05:17,860
Like, in this case, a
pouring action and then

98
00:05:17,860 --> 00:05:21,420
repeat it and maybe
try to generalize it.

99
00:05:21,420 --> 00:05:26,860
So force/torque sensors
provide the robot feedback

100
00:05:26,860 --> 00:05:29,890
about the intensity
of the interaction

101
00:05:29,890 --> 00:05:33,160
with the external world.

102
00:05:33,160 --> 00:05:36,190
But they're not allowed
to have the robot have

103
00:05:36,190 --> 00:05:39,490
an idea of where this kind
of interaction is occurring.

104
00:05:39,490 --> 00:05:43,930
So, for this, we have artificial
skin covering the robot,

105
00:05:43,930 --> 00:05:46,340
as I told you.

106
00:05:46,340 --> 00:05:49,450
And the technology used
for the kind of thing,

107
00:05:49,450 --> 00:05:52,990
that you can see here, for the
palm of the hand of the robot,

108
00:05:52,990 --> 00:05:59,320
is capacitive and it's
similar to the technology used

109
00:05:59,320 --> 00:06:01,386
for smart phones.

110
00:06:07,630 --> 00:06:10,360
It's using, if you can
see these yellow dots.

111
00:06:10,360 --> 00:06:16,375
They are all electrodes that,
together with another arm,

112
00:06:16,375 --> 00:06:18,240
they form a capacitor.

113
00:06:23,950 --> 00:06:26,690
The way the arm works
of this capacitor

114
00:06:26,690 --> 00:06:29,170
are formed when there
is an interaction

115
00:06:29,170 --> 00:06:30,840
with the environment
allowed to provide

116
00:06:30,840 --> 00:06:36,780
the feedback of the kind
of intensity [INAUDIBLE]

117
00:06:36,780 --> 00:06:37,910
the location itself.

118
00:06:37,910 --> 00:06:40,960
It's providing
information about where

119
00:06:40,960 --> 00:06:42,250
this interaction is occurring.

120
00:06:42,250 --> 00:06:45,310
And the artifical
skin is actually

121
00:06:45,310 --> 00:06:48,410
really useful for
an embodied agent.

122
00:06:48,410 --> 00:06:52,000
And for reasons like
we see in this video,

123
00:06:52,000 --> 00:06:57,700
without using this feedback, if
you have a very light object,

124
00:06:57,700 --> 00:07:01,245
the robot is not able to detect
the process is interrupting

125
00:07:01,245 --> 00:07:01,870
with something.

126
00:07:01,870 --> 00:07:05,230
It's just closing the end and
it doesn't have any feedback.

127
00:07:05,230 --> 00:07:07,770
It's crushing the object.

128
00:07:07,770 --> 00:07:12,884
By using the sensors on
the fingertips of the hand,

129
00:07:12,884 --> 00:07:14,800
the robot is able to
detect that it's actually

130
00:07:14,800 --> 00:07:17,740
touching something,
and therefore, it's

131
00:07:17,740 --> 00:07:22,120
stopping the action without
crushing the object.

132
00:07:22,120 --> 00:07:28,140
So other useful things that can
be done with artificial skin.

133
00:07:28,140 --> 00:07:31,120
This is an example of
combining the information

134
00:07:31,120 --> 00:07:34,420
from the force/torque sensor
and the artificial skin.

135
00:07:34,420 --> 00:07:37,620
So the artificial skin
allows to detect where

136
00:07:37,620 --> 00:07:40,660
and the direction
of the kind of force

137
00:07:40,660 --> 00:07:46,750
that is applied to the robot.

138
00:07:46,750 --> 00:07:52,225
And in that case, the
robot is counterbalancing.

139
00:07:52,225 --> 00:07:55,710
It's negating the effect of
gravity and of internal forces.

140
00:07:55,710 --> 00:07:58,760
So it's basically having
arm floating around like

141
00:07:58,760 --> 00:08:00,070
if it was in space.

142
00:08:00,070 --> 00:08:01,630
And there is no friction.

143
00:08:01,630 --> 00:08:07,360
So by touching the
arm of the robot,

144
00:08:07,360 --> 00:08:11,550
we have the arm drifting
in the direction opposite

145
00:08:11,550 --> 00:08:14,870
to where the force is
applied, or the torque.

146
00:08:14,870 --> 00:08:17,110
You can see that,
the arm is turning.

147
00:08:17,110 --> 00:08:21,160
But as you can see, like it was
in space without any friction,

148
00:08:21,160 --> 00:08:23,770
because it's actually
negating both gravity

149
00:08:23,770 --> 00:08:26,390
and internal forces.

150
00:08:26,390 --> 00:08:30,020
And finally, again about
the artificial skin,

151
00:08:30,020 --> 00:08:31,840
there's some work by
Alessandro, but he's

152
00:08:31,840 --> 00:08:33,298
going to talk about
something else,

153
00:08:33,298 --> 00:08:38,100
but I found is particularly
interesting to show him.

154
00:08:38,100 --> 00:08:45,710
This is an example of the
robot self-calibrating

155
00:08:45,710 --> 00:08:51,100
the model of its own body,
with respect to its own hand.

156
00:08:51,100 --> 00:08:54,670
The idea is to have the robot
use the tactile feedback

157
00:08:54,670 --> 00:08:59,800
from the fingertip and the skin
of its forearm, for instance,

158
00:08:59,800 --> 00:09:08,500
to learn the position between
the fingertip and the arm.

159
00:09:08,500 --> 00:09:12,850
And therefore, it's able,
first by touching itself,

160
00:09:12,850 --> 00:09:14,830
to learn the
correspondence and then

161
00:09:14,830 --> 00:09:18,890
to actually show
that it has learned

162
00:09:18,890 --> 00:09:22,750
this kind of correlation by
reaching the point when someone

163
00:09:22,750 --> 00:09:24,230
else touches the thing point.

164
00:09:24,230 --> 00:09:26,410
And tries to reach it.

165
00:09:26,410 --> 00:09:29,020
And therefore, this can be seen
as a way of self-calibrating

166
00:09:29,020 --> 00:09:31,540
without the need of a model of
the kinematics of the robot.

167
00:09:31,540 --> 00:09:34,840
The robot would just be able
to explore itself and learn

168
00:09:34,840 --> 00:09:38,930
how different part of it's
body relate one to the other.

169
00:09:38,930 --> 00:09:45,030
And again, related to
self-calibration sometimes,

170
00:09:45,030 --> 00:09:47,980
but this is more of a
calibration between vision

171
00:09:47,980 --> 00:09:49,690
and motor activity.

172
00:09:49,690 --> 00:09:58,510
This is a work appeared in
2014 in which, basically,

173
00:09:58,510 --> 00:10:02,150
the correlation between the kind
of actions that the robot is

174
00:10:02,150 --> 00:10:07,910
able to perform is calibrated
with respect to its ability

175
00:10:07,910 --> 00:10:09,476
to perceive the work.

176
00:10:09,476 --> 00:10:12,800
So, in this kind of
video I'm going to show,

177
00:10:12,800 --> 00:10:16,520
the robot is trying
to reach for an object

178
00:10:16,520 --> 00:10:24,800
and failing in its action
because the actual model

179
00:10:24,800 --> 00:10:27,380
of the word that you use
to perform the reaching

180
00:10:27,380 --> 00:10:32,350
is not aligned with the 3-D
model provided by vision.

181
00:10:32,350 --> 00:10:35,480
And this can happen
due to smaller errors

182
00:10:35,480 --> 00:10:39,470
in the kinematics
or in the vision

183
00:10:39,470 --> 00:10:43,480
and therefore even
just small errors

184
00:10:43,480 --> 00:10:45,680
cause a complete
failure of the system.

185
00:10:45,680 --> 00:10:50,795
And therefore, by trying to
correlate that information,

186
00:10:50,795 --> 00:10:54,740
the one from the kinematics,
in this case the robot

187
00:10:54,740 --> 00:11:01,370
is using it on the fingertip
to the see where it actually

188
00:11:01,370 --> 00:11:03,150
is in the image.

189
00:11:03,150 --> 00:11:08,150
And when the kinematic model
predicts the hand should be.

190
00:11:08,150 --> 00:11:11,120
So the green dot is the point,
predicted by kinematics,

191
00:11:11,120 --> 00:11:14,480
of where the system expects
the fingertip to be,

192
00:11:14,480 --> 00:11:16,860
and where actually
the fingertip is.

193
00:11:16,860 --> 00:11:19,270
Of course, by learning
this relation,

194
00:11:19,270 --> 00:11:22,940
the robot is able to cope with
this kind of misalignment.

195
00:11:22,940 --> 00:11:27,155
And therefore, after a
first calibration phase,

196
00:11:27,155 --> 00:11:31,948
it's able to perform the
reaching action, successfully,

197
00:11:31,948 --> 00:11:34,630
as you will see in a moment.

198
00:11:34,630 --> 00:11:38,580
Also, this kind of
ability of calibrating

199
00:11:38,580 --> 00:11:42,590
would be pretty useful in case
of the situations in which

200
00:11:42,590 --> 00:11:45,500
the robot is damaged, and
therefore, it's actual model

201
00:11:45,500 --> 00:11:47,780
changes completely.

202
00:11:47,780 --> 00:11:50,860
As you can see,
now it's reaching

203
00:11:50,860 --> 00:11:54,629
and it's performing
the grasp correctly.

204
00:12:01,190 --> 00:12:07,010
Finally, before going
with the actual talk,

205
00:12:07,010 --> 00:12:10,710
I'm going to show a final
video about balancing.

206
00:12:10,710 --> 00:12:13,460
Some of you have asked
if the robot works,

207
00:12:13,460 --> 00:12:16,760
and the robot is currently
not able to work,

208
00:12:16,760 --> 00:12:18,262
but this is a video
from the people

209
00:12:18,262 --> 00:12:19,970
from the group that
is actually in charge

210
00:12:19,970 --> 00:12:22,325
of making the robot work.

211
00:12:22,325 --> 00:12:24,200
The first step, of
course, will be balancing.

212
00:12:24,200 --> 00:12:26,950
And this is an example of it.

213
00:12:26,950 --> 00:12:29,630
It's actually one
foot balancing where

214
00:12:29,630 --> 00:12:35,930
multiple components of what
I've shown you about the iCub

215
00:12:35,930 --> 00:12:37,322
so far.

216
00:12:37,322 --> 00:12:39,530
Torque sensing, initial
sensing are combined together

217
00:12:39,530 --> 00:12:43,070
to have the robot
stand on one foot.

218
00:12:43,070 --> 00:12:48,060
And also be able to cope
with the interaction

219
00:12:48,060 --> 00:12:53,503
with internal forces that
could try to have it fall.

220
00:13:00,420 --> 00:13:03,250
They are applying
forces to the robot

221
00:13:03,250 --> 00:13:06,500
and it's able to
detect the force

222
00:13:06,500 --> 00:13:11,554
and to cope a bit with the
forces but to stay stable.

223
00:13:11,554 --> 00:13:14,720
OK so, this was just
a brief overview

224
00:13:14,720 --> 00:13:19,280
of some things that can
be done with the iCub.

225
00:13:19,280 --> 00:13:21,860
And actually, the next
talk will be a bit more

226
00:13:21,860 --> 00:13:24,430
about what's going on with it.

227
00:13:29,270 --> 00:13:31,910
ALESSANDRO RONCONE: I want
to talk to you about part

228
00:13:31,910 --> 00:13:38,595
of my PhD project that was
about tackling the perception

229
00:13:38,595 --> 00:13:39,095
problem.

230
00:13:44,510 --> 00:13:48,980
Tackling the perception
problem through the use

231
00:13:48,980 --> 00:13:50,810
of multisensor integration.

232
00:13:50,810 --> 00:13:55,920
And specifically, I narrowed
down this big problem

233
00:13:55,920 --> 00:14:00,350
by implementing a model of
PeriPersonal Space on the iCub.

234
00:14:00,350 --> 00:14:04,130
That is, biology
inspired approach.

235
00:14:04,130 --> 00:14:06,140
PeriPersonal Space
is a concept that

236
00:14:06,140 --> 00:14:10,040
has been known in neuroscience
and psychology for years.

237
00:14:10,040 --> 00:14:13,760
And so, let me start with
what PeriPersonal Space is,

238
00:14:13,760 --> 00:14:18,050
and why it is so important
for humans and animals.

239
00:14:18,050 --> 00:14:20,120
It is defined as
the space around us,

240
00:14:20,120 --> 00:14:23,610
within which objects can
be grasped and manipulated.

241
00:14:23,610 --> 00:14:26,540
It's an interface,
basically, between our body

242
00:14:26,540 --> 00:14:27,830
and the external world.

243
00:14:27,830 --> 00:14:30,080
And for this reason,
it benefits from

244
00:14:30,080 --> 00:14:32,720
a multimodal integrated
representation

245
00:14:32,720 --> 00:14:36,320
that merges information
between different modalities.

246
00:14:36,320 --> 00:14:39,710
And historically, these
have been the vision system,

247
00:14:39,710 --> 00:14:42,980
the tactile system, the
perception, auditory system,

248
00:14:42,980 --> 00:14:45,570
and even the motor system.

249
00:14:45,570 --> 00:14:47,180
Historically, it
has been studied

250
00:14:47,180 --> 00:14:49,070
by two different fields.

251
00:14:49,070 --> 00:14:53,220
The neurophysiology on
one side, and all that

252
00:14:53,220 --> 00:14:57,440
could be related to psychology
and developmental psychology

253
00:14:57,440 --> 00:14:58,400
on the other.

254
00:14:58,400 --> 00:15:00,920
They basically follow the
two different approaches,

255
00:15:00,920 --> 00:15:04,400
the former being bottom up,
the latter being top down.

256
00:15:04,400 --> 00:15:07,090
And they came out with
different outcomes.

257
00:15:07,090 --> 00:15:11,420
And the former emphasizes
the role of the perception

258
00:15:11,420 --> 00:15:14,030
and it interplays
with the motor system

259
00:15:14,030 --> 00:15:18,980
in the control of movement,
whereas the latter was focusing

260
00:15:18,980 --> 00:15:21,660
mainly on the
multisensory aspect, that

261
00:15:21,660 --> 00:15:25,340
is, how different modalities
were combined together

262
00:15:25,340 --> 00:15:28,820
in order to form a
coherent view of the body

263
00:15:28,820 --> 00:15:31,010
and the nearby space.

264
00:15:31,010 --> 00:15:33,300
Luckily, in recent
years, they decided

265
00:15:33,300 --> 00:15:37,190
to converge to a common ground,
and a shared interpretation,

266
00:15:37,190 --> 00:15:38,810
and for the purposes
of my work I

267
00:15:38,810 --> 00:15:42,200
would like to highlight
the main aspect.

268
00:15:42,200 --> 00:15:45,680
Firstly, and this one might be
of interest from an engineering

269
00:15:45,680 --> 00:15:46,710
perspective.

270
00:15:46,710 --> 00:15:51,650
PeriPersonal Space is made
of different reference

271
00:15:51,650 --> 00:15:55,220
points that are located in
different regions of the brain.

272
00:15:55,220 --> 00:15:57,350
And there might be
a way for the brain

273
00:15:57,350 --> 00:16:01,430
to switch from one to another,
according to different contact

274
00:16:01,430 --> 00:16:02,630
and goal.

275
00:16:02,630 --> 00:16:07,040
And secondly, as I was
saying, PeriPersonal Space

276
00:16:07,040 --> 00:16:09,500
benefits from a
multisensory integration

277
00:16:09,500 --> 00:16:12,300
in order to form a coherent
view of the body understanding

278
00:16:12,300 --> 00:16:13,010
space.

279
00:16:13,010 --> 00:16:15,900
In this experiment made
by Fogassi in 1996,

280
00:16:15,900 --> 00:16:20,470
they basically found a number
of so-called visual tactile

281
00:16:20,470 --> 00:16:24,850
neurons that are set up neurons
that fire both stimulated

282
00:16:24,850 --> 00:16:28,400
in a specific skin part
and if an object is

283
00:16:28,400 --> 00:16:31,140
presented in surrounding space.

284
00:16:31,140 --> 00:16:34,480
So this means that
these neurons code

285
00:16:34,480 --> 00:16:37,080
both the visual information
and the tactile information.

286
00:16:37,080 --> 00:16:39,680
But they also have some
proprioceptive information,

287
00:16:39,680 --> 00:16:42,260
because they are basically
attached to the body part

288
00:16:42,260 --> 00:16:44,090
that they belong to.

289
00:16:44,090 --> 00:16:48,140
Lastly, one of the main
properties of this presentation

290
00:16:48,140 --> 00:16:50,690
is in basic plasticity.

291
00:16:50,690 --> 00:16:52,910
And for example,
in this experiment,

292
00:16:52,910 --> 00:16:58,490
made by Iriki ten years ago,
the extension of this receptive

293
00:16:58,490 --> 00:17:01,850
field in the visual space,
in the surrounding space,

294
00:17:01,850 --> 00:17:04,849
after training with
a rake, have been

295
00:17:04,849 --> 00:17:07,670
shown to go up to
enclose the tool

296
00:17:07,670 --> 00:17:13,000
as if this tool becomes
part of the body.

297
00:17:13,000 --> 00:17:15,170
So through experience
and through tool use,

298
00:17:15,170 --> 00:17:21,369
the monkey was able to
grow this receptive field.

299
00:17:25,910 --> 00:17:27,680
Those are properties
that are very nice,

300
00:17:27,680 --> 00:17:30,680
and we would like them to
be available for the robot.

301
00:17:30,680 --> 00:17:32,690
And, in general
robotics, the work

302
00:17:32,690 --> 00:17:36,380
related to PeriPersonal Space
can be divided into two groups.

303
00:17:36,380 --> 00:17:40,910
On the one side, the model,
and the simulation basically.

304
00:17:40,910 --> 00:17:43,565
The closest one to
my work was the one

305
00:17:43,565 --> 00:17:46,800
from Fuke, a colleague that
are from [INAUDIBLE] lab,

306
00:17:46,800 --> 00:17:49,610
in which they used a
stimulated robot in order

307
00:17:49,610 --> 00:17:55,340
to model the mechanisms
that are leading

308
00:17:55,340 --> 00:17:57,650
to this visual-tactile
presentation.

309
00:17:57,650 --> 00:18:00,650
On the other side, there are
the engineering approaches

310
00:18:00,650 --> 00:18:02,320
that are few.

311
00:18:02,320 --> 00:18:05,210
The closest one is this
one by Mittendorfer

312
00:18:05,210 --> 00:18:08,220
from Gordon Cheng's
lab, in which they first

313
00:18:08,220 --> 00:18:09,680
developed the multimodal skin.

314
00:18:09,680 --> 00:18:12,990
So they developed the hardware
to be able to do that.

315
00:18:12,990 --> 00:18:19,770
And then they use the to trigger
local avoidance responses,

316
00:18:19,770 --> 00:18:22,280
reflexes to incoming objects.

317
00:18:22,280 --> 00:18:24,900
We are trying to position
ourself in the middle.

318
00:18:24,900 --> 00:18:26,710
Let's say, we are
not trying to create

319
00:18:26,710 --> 00:18:28,540
a perfect model of
PeriPersonal Space

320
00:18:28,540 --> 00:18:32,690
from a biological perspective.

321
00:18:32,690 --> 00:18:34,190
But on the other
side, we would like

322
00:18:34,190 --> 00:18:38,750
to have something that is
also working, and useful

323
00:18:38,750 --> 00:18:41,490
for our proposals.

324
00:18:41,490 --> 00:18:44,680
So from now on, I will divide
the presentation in two parts.

325
00:18:44,680 --> 00:18:47,250
The first will be
about the model.

326
00:18:47,250 --> 00:18:50,990
So what we think will be useful
for tackling the problem,

327
00:18:50,990 --> 00:18:54,500
on the other side I show you an
application of this model, that

328
00:18:54,500 --> 00:18:58,470
is basically using the
low-color presentation in order

329
00:18:58,470 --> 00:19:02,600
to trigger avoidance responses
or reaching responses

330
00:19:02,600 --> 00:19:06,140
distributed throughout the body.

331
00:19:06,140 --> 00:19:07,940
So let me start
with the proposed

332
00:19:07,940 --> 00:19:10,580
model of PeriPersonal Space.

333
00:19:10,580 --> 00:19:14,070
Loosely inspired by the
neurophysiological findings

334
00:19:14,070 --> 00:19:19,190
like we discussed before, we
developed this PeriPersonal

335
00:19:19,190 --> 00:19:21,980
Space presentation
by means of access

336
00:19:21,980 --> 00:19:25,760
of facial receptive field
that we are going out

337
00:19:25,760 --> 00:19:27,440
from the robot's skin.

338
00:19:27,440 --> 00:19:30,780
So basically, they were
extending the tactile domain

339
00:19:30,780 --> 00:19:33,200
into nearby space.

340
00:19:33,200 --> 00:19:40,030
Each tactile, that is, each pair
the iCub skin is composed of,

341
00:19:40,030 --> 00:19:43,130
will experience a set
of multisensory events.

342
00:19:43,130 --> 00:19:47,450
So basically, you
are letting the robot

343
00:19:47,450 --> 00:19:52,340
learn this visual-tactile
sensations by taking an object

344
00:19:52,340 --> 00:19:55,890
and making contact
on the skin part.

345
00:19:55,890 --> 00:19:58,760
We learn it by
tactile experience

346
00:19:58,760 --> 00:20:02,840
we learn a sort of probability
of being touched prior

347
00:20:02,840 --> 00:20:06,520
to contact activation when
the new incoming object is

348
00:20:06,520 --> 00:20:08,510
presented.

349
00:20:08,510 --> 00:20:12,590
And we basically created this
cone shape receptive field

350
00:20:12,590 --> 00:20:15,350
that is going from
each of the taxels.

351
00:20:15,350 --> 00:20:19,610
And for any object that is
entering this receptive field,

352
00:20:19,610 --> 00:20:25,210
we created we called a buffer,
of the path so basically,

353
00:20:25,210 --> 00:20:27,620
the idea is that the orbit
has some information from what

354
00:20:27,620 --> 00:20:30,980
was going on before they
touch, the actual contact.

355
00:20:30,980 --> 00:20:36,540
And if the object eventually
ends up touching the tactile,

356
00:20:36,540 --> 00:20:41,900
it will be labeled as
a positive event that

357
00:20:41,900 --> 00:20:45,580
will enforce the
probability of the event

358
00:20:45,580 --> 00:20:47,960
the ending of
touching the taxel.

359
00:20:47,960 --> 00:20:50,870
If not, for example, it might
be that the object enters

360
00:20:50,870 --> 00:20:52,890
this receptive field,
and in the end,

361
00:20:52,890 --> 00:20:56,030
ends up touching another taxel.

362
00:20:56,030 --> 00:20:57,960
This will be labeled
as a negative.

363
00:20:57,960 --> 00:21:01,400
So at the end, we will have a
set of positive and negative

364
00:21:01,400 --> 00:21:04,320
events a taxel can learn from.

365
00:21:04,320 --> 00:21:07,270
This is three dimensional
space because the distance

366
00:21:07,270 --> 00:21:08,840
is three dimensional.

367
00:21:08,840 --> 00:21:13,370
And we narrowed it down to
a one dimensional domain,

368
00:21:13,370 --> 00:21:15,290
by basically, taking the
norm of the distance,

369
00:21:15,290 --> 00:21:19,460
but also the relative position
of the object and the taxel.

370
00:21:19,460 --> 00:21:23,050
In order for us to be able
to cope with the calibration

371
00:21:23,050 --> 00:21:26,420
errors that were amounting
up to a couple of centimeters

372
00:21:26,420 --> 00:21:27,950
that were significant.

373
00:21:31,400 --> 00:21:34,210
One dimensional
variable has been

374
00:21:34,210 --> 00:21:36,860
discretized into a set of bins.

375
00:21:36,860 --> 00:21:40,280
And for each bin, we
computed the probability

376
00:21:40,280 --> 00:21:44,720
of an event belonging to
that, of being touched.

377
00:21:44,720 --> 00:21:47,990
So the idea is that,
at 20 centimeters,

378
00:21:47,990 --> 00:21:50,060
the probability of being
touched would be lower

379
00:21:50,060 --> 00:21:51,420
than a zero centimeter.

380
00:21:51,420 --> 00:21:52,890
This is the intuitive idea.

381
00:21:56,410 --> 00:21:59,540
Over this one dimensional
visualization,

382
00:21:59,540 --> 00:22:01,850
we used a partial window
interpolation technique

383
00:22:01,850 --> 00:22:04,790
in order to provide us with
the two dimensional function

384
00:22:04,790 --> 00:22:08,450
that, at the end, we
give up a inactivation

385
00:22:08,450 --> 00:22:14,400
value that is proportional with
the distance of the object.

386
00:22:14,400 --> 00:22:17,960
So as soon as the new object
will enter the receptive field,

387
00:22:17,960 --> 00:22:22,700
I will have the taxel fire
before being contacted.

388
00:22:26,090 --> 00:22:28,070
We did, basically,
two experiments.

389
00:22:28,070 --> 00:22:31,940
Initially, we did a simulation
in a mock lab in order

390
00:22:31,940 --> 00:22:35,090
to assess the convergence
on the long term

391
00:22:35,090 --> 00:22:39,020
learning, one-shot
learning behavior,

392
00:22:39,020 --> 00:22:43,880
to assess if our model was
able to cope with noise,

393
00:22:43,880 --> 00:22:46,160
with the current
calibration errors.

394
00:22:46,160 --> 00:22:48,010
And then, we went
on the real robot.

395
00:22:48,010 --> 00:22:50,540
We presented them with
different objects.

396
00:22:50,540 --> 00:22:54,500
And we were basically
touching the robot 100 times

397
00:22:54,500 --> 00:22:57,780
in order to make it learn
these presentations.

398
00:22:57,780 --> 00:23:00,440
So, trust me, I don't
want to bother you

399
00:23:00,440 --> 00:23:03,260
with this kind of
technicalities,

400
00:23:03,260 --> 00:23:06,106
but we did a lot of work.

401
00:23:06,106 --> 00:23:09,170
This is, basically,
the math of the result.

402
00:23:09,170 --> 00:23:14,300
So, let me go on the second
part, in which the main problem

403
00:23:14,300 --> 00:23:17,270
was for the robot to
detect the object visually.

404
00:23:17,270 --> 00:23:21,140
In order for us to do that,
we developed a 3D tracking

405
00:23:21,140 --> 00:23:25,010
algorithm, that was able to
track a [INAUDIBLE] object,

406
00:23:25,010 --> 00:23:25,925
basically.

407
00:23:25,925 --> 00:23:29,240
To design, we used
some software that

408
00:23:29,240 --> 00:23:31,970
was only available in the
iCub software repository.

409
00:23:31,970 --> 00:23:36,350
The engine provides you
with some basic algorithms

410
00:23:36,350 --> 00:23:38,600
that you can play with.

411
00:23:38,600 --> 00:23:42,500
And namely, we used a two
dimensional optical flow

412
00:23:42,500 --> 00:23:45,780
made by Carlo and a
2D particle filter

413
00:23:45,780 --> 00:23:47,750
and a 3D stereo
vision algorithm,

414
00:23:47,750 --> 00:23:50,690
that is basically the same
as I was showing before

415
00:23:50,690 --> 00:23:53,660
during the recognition game.

416
00:23:53,660 --> 00:24:00,470
And this basically was
feeding a 3-D camera

417
00:24:00,470 --> 00:24:04,980
to provide the robot estimation
of the position of the object.

418
00:24:04,980 --> 00:24:10,120
So, the idea is that the motion
detector from the optical flow

419
00:24:10,120 --> 00:24:15,550
act as a trigger for the
subsequent pipeline, in which,

420
00:24:15,550 --> 00:24:18,340
basically, after a
consistent enough motion

421
00:24:18,340 --> 00:24:27,800
in this optical flow module,
this would be a template

422
00:24:27,800 --> 00:24:31,880
to be taught in the visual in
the 2D visual world by this.

423
00:24:31,880 --> 00:24:35,960
Then, that this information
is sent to the 3D depth map

424
00:24:35,960 --> 00:24:39,610
and this would be feeding
the camera feature in order

425
00:24:39,610 --> 00:24:44,000
to provide us with the table
representation because,

426
00:24:44,000 --> 00:24:47,840
obviously, the stereo system
doesn't work that good

427
00:24:47,840 --> 00:24:49,980
in our context.

428
00:24:49,980 --> 00:24:52,970
And this, if it works--

429
00:24:52,970 --> 00:24:55,730
no.

430
00:24:55,730 --> 00:24:57,280
OK.

431
00:24:57,280 --> 00:24:59,600
On my laptop, it works here.

432
00:24:59,600 --> 00:25:00,140
OK.

433
00:25:00,140 --> 00:25:00,682
Now it works.

434
00:25:00,682 --> 00:25:01,181
OK.

435
00:25:01,181 --> 00:25:01,990
This is the idea.

436
00:25:01,990 --> 00:25:03,970
So I was basically
waving, moving

437
00:25:03,970 --> 00:25:06,340
the object in the beginning.

438
00:25:06,340 --> 00:25:07,210
OK.

439
00:25:07,210 --> 00:25:10,820
Then when it is detected,
this pattern starts.

440
00:25:10,820 --> 00:25:13,237
And you can see
here the tracking.

441
00:25:13,237 --> 00:25:14,320
This is the stereo vision.

442
00:25:14,320 --> 00:25:17,870
This the final outcome.

443
00:25:17,870 --> 00:25:22,370
This was used for the learning.

444
00:25:22,370 --> 00:25:25,880
We did a lot of iterations
of these objects

445
00:25:25,880 --> 00:25:28,130
that are approaching the
skin on different body parts.

446
00:25:28,130 --> 00:25:29,510
This is the graph.

447
00:25:29,510 --> 00:25:31,620
I don't want to talk about that.

448
00:25:31,620 --> 00:25:34,400
So let me start with the video.

449
00:25:34,400 --> 00:25:36,730
This is basically the skin.

450
00:25:36,730 --> 00:25:39,920
And this is the part that
it was trained before.

451
00:25:39,920 --> 00:25:42,830
When there is a contact,
there is activation here.

452
00:25:42,830 --> 00:25:44,840
You can see here the activation.

453
00:25:44,840 --> 00:25:49,970
And soon after, this thing
worked also with one example.

454
00:25:49,970 --> 00:25:53,270
The taxel starts firing
before the contact.

455
00:25:53,270 --> 00:25:57,080
And obviously, this is
improved over the time.

456
00:25:57,080 --> 00:25:59,770
And it depends on the
body part that is touched.

457
00:25:59,770 --> 00:26:02,750
For example, if I touch here,
I'm coming from the top.

458
00:26:02,750 --> 00:26:08,430
So the representation
starts firing mainly here.

459
00:26:08,430 --> 00:26:11,330
And this, obviously, depends
on the specific body part.

460
00:26:11,330 --> 00:26:14,570
Now, I think that I'm
going to touch the hand.

461
00:26:14,570 --> 00:26:20,092
And so after a while, you will
have an activation on the hand.

462
00:26:20,092 --> 00:26:22,550
Obviously, I will have also
some activation in the forearm,

463
00:26:22,550 --> 00:26:26,090
because I was getting
closer to the forearm.

464
00:26:26,090 --> 00:26:28,330
And as an application
of this, this one

465
00:26:28,330 --> 00:26:30,340
is simply a presentation.

466
00:26:30,340 --> 00:26:31,940
So it's not that usable.

467
00:26:31,940 --> 00:26:38,180
We basically exploited it in
order to develop an avoidance--

468
00:26:38,180 --> 00:26:39,940
a margin of safety
around the body.

469
00:26:39,940 --> 00:26:42,260
Let's say if the
taxel is firing,

470
00:26:42,260 --> 00:26:47,090
I would like it to go
away from the object,

471
00:26:47,090 --> 00:26:50,330
assuming that this can be a
potentially harmful object.

472
00:26:50,330 --> 00:26:52,130
And on the other
way, I would like

473
00:26:52,130 --> 00:26:55,640
it to be able to reach
with any body part

474
00:26:55,640 --> 00:26:57,640
the object under consideration.

475
00:26:57,640 --> 00:27:00,710
So to this end, we developed
the avoidance and catching

476
00:27:00,710 --> 00:27:05,300
controller that was able to
leverage on this distributed

477
00:27:05,300 --> 00:27:10,640
information and perform
a sensor-based guidance

478
00:27:10,640 --> 00:27:13,750
of the model actions by
means of this visual tactile

479
00:27:13,750 --> 00:27:14,930
associations.

480
00:27:14,930 --> 00:27:17,340
And this is basically
how it works.

481
00:27:17,340 --> 00:27:20,530
So this is at the testing stage.

482
00:27:20,530 --> 00:27:22,700
So I already learned
the representation.

483
00:27:22,700 --> 00:27:25,670
As soon as I get
closer, the taxel

484
00:27:25,670 --> 00:27:28,980
starts firing, because of the
probabilities I was learning.

485
00:27:28,980 --> 00:27:31,880
And the arm goes away.

486
00:27:31,880 --> 00:27:34,290
Obviously, the movement
depends on the specific skin

487
00:27:34,290 --> 00:27:35,490
part that has been touched.

488
00:27:35,490 --> 00:27:39,920
If I'm touching here, the
object will go away from here.

489
00:27:39,920 --> 00:27:42,050
If I'm coming from the top--

490
00:27:42,050 --> 00:27:46,160
I think this one was
doing from the top, yes--

491
00:27:46,160 --> 00:27:49,640
the object will go
away from the back.

492
00:27:49,640 --> 00:27:52,500
The object will be going a
way from another direction.

493
00:27:52,500 --> 00:27:56,930
And the idea here is not to,
basically, tackle the problem

494
00:27:56,930 --> 00:28:01,334
from a classical
robotics approach.

495
00:28:01,334 --> 00:28:02,750
But the basic
idea-- this behavior

496
00:28:02,750 --> 00:28:05,010
emerges from the learning.

497
00:28:05,010 --> 00:28:07,970
And the idea was very simple.

498
00:28:07,970 --> 00:28:11,120
We were basically looking at
the taxel that we were firing.

499
00:28:11,120 --> 00:28:14,030
If they were firing
enough, then we

500
00:28:14,030 --> 00:28:15,660
were recording their position.

501
00:28:15,660 --> 00:28:18,020
And we were doing,
basically, a population

502
00:28:18,020 --> 00:28:20,450
coding that is a weighted
average according

503
00:28:20,450 --> 00:28:23,110
to the activation
and the prediction.

504
00:28:23,110 --> 00:28:24,920
We did that to both
for the position

505
00:28:24,920 --> 00:28:26,150
of the taxel and the normal.

506
00:28:26,150 --> 00:28:29,150
So at the end if you have
a bunch of tactiles here,

507
00:28:29,150 --> 00:28:33,590
we will end up with one
point to go away from.

508
00:28:33,590 --> 00:28:37,930
And on the other side, the
catching, the reaching,

509
00:28:37,930 --> 00:28:40,400
was basically the same, but
in the opposite direction.

510
00:28:40,400 --> 00:28:41,990
So if I want to
avoid, I do this.

511
00:28:41,990 --> 00:28:44,150
If I want to catch, I do this.

512
00:28:44,150 --> 00:28:47,250
Obviously, if you
do it in the hand,

513
00:28:47,250 --> 00:28:50,330
this would be a standard
robotic reaching.

514
00:28:50,330 --> 00:28:54,260
But this actually
can be triggered also

515
00:28:54,260 --> 00:28:55,340
in different body parts.

516
00:28:55,340 --> 00:28:58,900
As you can see here, I get a
virtual activation, and then

517
00:28:58,900 --> 00:29:00,990
the physical contact.

518
00:29:00,990 --> 00:29:09,320
And yes, basically, our design
was to use the same controller

519
00:29:09,320 --> 00:29:12,290
for both of the behaviors.

520
00:29:15,910 --> 00:29:16,410
OK.

521
00:29:16,410 --> 00:29:19,460
This is also some technicalities
that I don't want to show you.

522
00:29:19,460 --> 00:29:22,760
So in conclusion, the
detector presented here

523
00:29:22,760 --> 00:29:25,100
is, to our knowledge,
the first attempt

524
00:29:25,100 --> 00:29:28,610
at creating a decentralized,
multisensory visual tactile

525
00:29:28,610 --> 00:29:32,550
representation for a
robot and its nearby space

526
00:29:32,550 --> 00:29:36,770
by means of the distributed
skin and interaction

527
00:29:36,770 --> 00:29:38,510
with the environment.

528
00:29:38,510 --> 00:29:40,629
One of the assets of
our representation

529
00:29:40,629 --> 00:29:41,670
is that learning is fast.

530
00:29:41,670 --> 00:29:44,750
As you were seeing,
it can learn, also,

531
00:29:44,750 --> 00:29:46,520
from one single example.

532
00:29:46,520 --> 00:29:50,090
It's in parallel for the
whole body in the sense

533
00:29:50,090 --> 00:29:52,340
that every tactile
learns independently.

534
00:29:52,340 --> 00:29:55,450
Its own representation
is incremental in a sense

535
00:29:55,450 --> 00:29:57,670
that it converges toward
a stable representation

536
00:29:57,670 --> 00:29:58,770
over the time.

537
00:29:58,770 --> 00:30:02,440
And importantly, it is
adapted from experience.

538
00:30:02,440 --> 00:30:04,310
So basically, it
can automatically

539
00:30:04,310 --> 00:30:06,595
compensate for errors
in the model that,

540
00:30:06,595 --> 00:30:09,580
for humanoid robots, is
one of the main problems

541
00:30:09,580 --> 00:30:12,930
when merging different
modalities OK.

542
00:30:12,930 --> 00:30:13,510
Thank you.

543
00:30:13,510 --> 00:30:16,281
If you have any question,
feel free to ask.

544
00:30:16,281 --> 00:30:17,780
RAFFAELLO CAMORIANO:
I am Raffaello.

545
00:30:17,780 --> 00:30:20,000
And today, I'll talk to
you about a little bit

546
00:30:20,000 --> 00:30:23,980
of my work on machine
learning and robotics,

547
00:30:23,980 --> 00:30:29,380
in particular some subsets
of machine learning

548
00:30:29,380 --> 00:30:31,450
which are the large
scale learning

549
00:30:31,450 --> 00:30:33,410
and incremental learning.

550
00:30:33,410 --> 00:30:38,220
But what do we expect
from our modern robot?

551
00:30:38,220 --> 00:30:41,925
And how can machine
learning help out with this?

552
00:30:41,925 --> 00:30:46,045
Well, we expect
modern robots to work

553
00:30:46,045 --> 00:30:49,510
in, particularly,
unstructured environments

554
00:30:49,510 --> 00:30:52,270
which they have
never seen before

555
00:30:52,270 --> 00:30:54,640
and to learn new
tasks on the fly

556
00:30:54,640 --> 00:30:57,010
depending on the
particular needs

557
00:30:57,010 --> 00:31:00,670
throughout the operation
of the robot itself

558
00:31:00,670 --> 00:31:02,580
and across different modalities.

559
00:31:02,580 --> 00:31:04,570
For instance--
vision, of course,

560
00:31:04,570 --> 00:31:09,830
but also tactile sensing which
is available on the iCub also

561
00:31:09,830 --> 00:31:12,370
proprioceptive
sensing, including

562
00:31:12,370 --> 00:31:16,220
force sensing, [INAUDIBLE]
and so on and so forth.

563
00:31:16,220 --> 00:31:19,570
And we want to do all of this
throughout a very long time

564
00:31:19,570 --> 00:31:21,040
span potentially.

565
00:31:21,040 --> 00:31:24,250
Because we expect
robots to be companions

566
00:31:24,250 --> 00:31:27,790
of humans in the real
world operating for maybe

567
00:31:27,790 --> 00:31:30,370
years or more.

568
00:31:30,370 --> 00:31:32,840
And this poses a
lot of challenges,

569
00:31:32,840 --> 00:31:35,650
especially from the
computational point of view.

570
00:31:35,650 --> 00:31:38,350
And machine learning
can actually

571
00:31:38,350 --> 00:31:43,600
help with this tackling
these challenges.

572
00:31:43,600 --> 00:31:45,090
For instance, there
are large scale

573
00:31:45,090 --> 00:31:52,820
learning methods, which are
algorithms which can work

574
00:31:52,820 --> 00:31:56,950
with very large scale datasets.

575
00:31:56,950 --> 00:32:00,460
For instance, if we have
millions of points gathered

576
00:32:00,460 --> 00:32:04,780
by the robot cameras
throughout 10 days

577
00:32:04,780 --> 00:32:06,970
and we want to
process them, well,

578
00:32:06,970 --> 00:32:09,490
if we use standard
machine learning methods,

579
00:32:09,490 --> 00:32:12,670
that will be a very
difficult problem

580
00:32:12,670 --> 00:32:20,410
to solve if we don't use, for
instance, randomizing methods

581
00:32:20,410 --> 00:32:23,890
and so on and so forth.

582
00:32:23,890 --> 00:32:30,890
Machine learning also has
incremental algorithms,

583
00:32:30,890 --> 00:32:37,030
which can allow the
learned model to be updated

584
00:32:37,030 --> 00:32:45,790
as new previously unseen
features are presented

585
00:32:45,790 --> 00:32:47,920
to the agent.

586
00:32:47,920 --> 00:32:53,140
And also, there is a
subfield of transfer learning

587
00:32:53,140 --> 00:32:58,840
which allows knowledge
learned for a particular task

588
00:32:58,840 --> 00:33:04,180
to be used for serving another
related task without the need

589
00:33:04,180 --> 00:33:10,630
for seeing many new
examples for the new task.

590
00:33:10,630 --> 00:33:15,920
So my main research focuses
are in machine learning.

591
00:33:15,920 --> 00:33:19,360
I work especially in large
scale learning methods,

592
00:33:19,360 --> 00:33:23,860
incremental learning,
and in the design

593
00:33:23,860 --> 00:33:31,760
of algorithms which allow for
computational and accuracy

594
00:33:31,760 --> 00:33:34,940
trade-offs.

595
00:33:34,940 --> 00:33:39,050
I will explain this
a bit more later.

596
00:33:39,050 --> 00:33:44,020
And as concerns
robotic applications,

597
00:33:44,020 --> 00:33:49,120
I work with Guilia,
Carlo and others

598
00:33:49,120 --> 00:33:52,190
on incremental
object recognition,

599
00:33:52,190 --> 00:33:56,950
so in a setting in which the
robot is presented new objects

600
00:33:56,950 --> 00:33:58,540
throughout a long time span.

601
00:33:58,540 --> 00:34:02,560
And it has to learn
them on the fly.

602
00:34:02,560 --> 00:34:12,730
And also, I'm working in a
system identification setting,

603
00:34:12,730 --> 00:34:15,310
which I will explain
later, related

604
00:34:15,310 --> 00:34:18,620
to the motion of the robot.

605
00:34:18,620 --> 00:34:23,920
So this is one of
the works which

606
00:34:23,920 --> 00:34:26,380
has occupied my last year.

607
00:34:26,380 --> 00:34:31,210
And it is related to
large scale learning.

608
00:34:31,210 --> 00:34:38,199
So if we consider that we may
have a very large n, which

609
00:34:38,199 --> 00:34:42,650
is a number of examples
we have access to,

610
00:34:42,650 --> 00:34:44,710
in the setting of
kernel methods,

611
00:34:44,710 --> 00:34:49,179
we may have to store a huge
matrix, the matrix K, which

612
00:34:49,179 --> 00:34:54,889
is n by n, which could be
simply impossible to store.

613
00:34:54,889 --> 00:35:01,340
So there are randomized methods,
like the Nystrom method, which

614
00:35:01,340 --> 00:35:09,190
enable to compute a low rank
approximation of the kernel

615
00:35:09,190 --> 00:35:23,960
metrics simply by throwing
a few points m at random,

616
00:35:23,960 --> 00:35:30,966
a few samples at random, and
building the metrics K and m,

617
00:35:30,966 --> 00:35:32,090
which is just much smaller.

618
00:35:32,090 --> 00:35:34,850
Because m is much
smaller than n.

619
00:35:34,850 --> 00:35:41,060
And this is a well-known
method in machine learning.

620
00:35:41,060 --> 00:35:47,470
But we tried to see it from
a different point of view

621
00:35:47,470 --> 00:35:47,980
than usual.

622
00:35:47,980 --> 00:35:51,890
Usually, this is seen just from
a computational point of view

623
00:35:51,890 --> 00:36:01,570
in order to fit a difficult
problem inside computers

624
00:36:01,570 --> 00:36:09,860
with limited capabilities
while we proposed

625
00:36:09,860 --> 00:36:14,880
to see the Nystrom
approximation as regularization

626
00:36:14,880 --> 00:36:17,950
of operation itself.

627
00:36:17,950 --> 00:36:27,250
So if you can see this, the
usual way in which the Nystrom

628
00:36:27,250 --> 00:36:31,520
method is applied, for instance,
with kernel regularized least

629
00:36:31,520 --> 00:36:34,160
squares.

630
00:36:34,160 --> 00:36:37,790
The parameter m, so
the number of examples

631
00:36:37,790 --> 00:36:43,340
we are taking at
random, is usually taken

632
00:36:43,340 --> 00:36:46,820
as large as possible
in order just

633
00:36:46,820 --> 00:36:51,850
to fit in the memory of
the available machines.

634
00:36:51,850 --> 00:36:57,200
While, actually, after
choosing a large m,

635
00:36:57,200 --> 00:37:00,085
it is often necessary
to regularize

636
00:37:00,085 --> 00:37:03,520
with deep neural
regularization, for instance.

637
00:37:03,520 --> 00:37:09,290
And this sounds a bit like
a waste of time and memory.

638
00:37:09,290 --> 00:37:14,780
Because, actually, what
regularization, roughly

639
00:37:14,780 --> 00:37:20,270
speaking, does is to discard
the irrelevant eigen components

640
00:37:20,270 --> 00:37:21,920
of the kernel metrics.

641
00:37:21,920 --> 00:37:24,650
So we observe that
we can do this

642
00:37:24,650 --> 00:37:34,360
by just less random examples,
so having a smaller model which

643
00:37:34,360 --> 00:37:38,390
can be computed more
efficiently and without having

644
00:37:38,390 --> 00:37:42,580
to regularize again later.

645
00:37:42,580 --> 00:37:47,150
So m, the number of
examples which are used,

646
00:37:47,150 --> 00:37:50,450
controls both the
regularization and

647
00:37:50,450 --> 00:37:53,850
the computational
complexity of our algorithm.

648
00:37:53,850 --> 00:37:57,650
This is very useful in a
robotic setting in which we

649
00:37:57,650 --> 00:38:00,230
have to deal with lots of data.

650
00:38:00,230 --> 00:38:08,340
As regards to the incremental
objects recognition task,

651
00:38:08,340 --> 00:38:12,840
this is another
project I'm working on.

652
00:38:12,840 --> 00:38:16,760
And imagine that the
robot has to work

653
00:38:16,760 --> 00:38:20,810
in an unknown environment, and
it is presented novel objects

654
00:38:20,810 --> 00:38:21,785
on the fly.

655
00:38:21,785 --> 00:38:30,770
And it has to update its
object recognition model

656
00:38:30,770 --> 00:38:32,990
in an efficient way
without retraining

657
00:38:32,990 --> 00:38:36,650
from scratch every time
a new object arrives.

658
00:38:36,650 --> 00:38:41,750
So this can be done easily
by a slight modification

659
00:38:41,750 --> 00:38:44,130
of the regularized
least squares algorithm

660
00:38:44,130 --> 00:38:45,700
and proper reweighting.

661
00:38:45,700 --> 00:38:48,940
An open question is how to
change the regularization

662
00:38:48,940 --> 00:38:50,510
as n grows.

663
00:38:50,510 --> 00:38:54,230
Because we didn't find
yet a way to efficiently

664
00:38:54,230 --> 00:38:58,880
update regularization
parameter in this case.

665
00:38:58,880 --> 00:39:01,250
So we are still working on this.

666
00:39:01,250 --> 00:39:07,295
The last project I'll talk about
is more related to, let's see,

667
00:39:07,295 --> 00:39:09,790
physics and motion.

668
00:39:09,790 --> 00:39:17,530
So we have an arbitrary
limb of the robot,

669
00:39:17,530 --> 00:39:19,250
for instance, the arm.

670
00:39:19,250 --> 00:39:30,380
And our task is to learn
a model which can provide

671
00:39:30,380 --> 00:39:31,750
an interesting dynamics model.

672
00:39:31,750 --> 00:39:36,220
So it can predict the
inner forces of the arm

673
00:39:36,220 --> 00:39:37,100
during motion.

674
00:39:37,100 --> 00:39:44,010
This is useful, for instance,
in a contact detection setting.

675
00:39:44,010 --> 00:39:48,050
So when the sensor readings
are different from the group

676
00:39:48,050 --> 00:39:51,310
predicted one, that means
that there may be a contact.

677
00:39:51,310 --> 00:39:59,570
Or for external force
estimation or, for example,

678
00:39:59,570 --> 00:40:04,770
for the identification of the
mass of a manipulated object.

679
00:40:04,770 --> 00:40:11,700
So we have some challenged
for this project.

680
00:40:11,700 --> 00:40:15,545
We have to devise a model
which could be interpretable,

681
00:40:15,545 --> 00:40:20,970
so in which the rigid body
dynamic parameter would

682
00:40:20,970 --> 00:40:24,870
be understandable
and intelligible

683
00:40:24,870 --> 00:40:27,780
for controlled purposes.

684
00:40:27,780 --> 00:40:30,870
And we wanted this
model to be more

685
00:40:30,870 --> 00:40:34,790
accurate than standard multibody
dynamics, rigid body dynamics

686
00:40:34,790 --> 00:40:36,540
model.

687
00:40:36,540 --> 00:40:42,220
And also, we want to adapt
to changing conditions

688
00:40:42,220 --> 00:40:43,360
throughout time.

689
00:40:43,360 --> 00:40:49,230
For instance, during the
operation of the robot,

690
00:40:49,230 --> 00:40:56,040
after one hour, the
changes in temperature

691
00:40:56,040 --> 00:41:00,830
determine a change also
in the dynamic properties

692
00:41:00,830 --> 00:41:04,030
of the mechanical
properties of the arm.

693
00:41:04,030 --> 00:41:12,910
And we want to accommodate for
this in an incremental way.

694
00:41:12,910 --> 00:41:14,920
So this is what we did.

695
00:41:14,920 --> 00:41:17,950
We implemented a
semi-parametric model

696
00:41:17,950 --> 00:41:21,870
which the first part
which has priority

697
00:41:21,870 --> 00:41:25,710
is a simple incremental
parametric model.

698
00:41:25,710 --> 00:41:31,662
And then we used random
features for building

699
00:41:31,662 --> 00:41:36,120
non-parametric incremental
model which can

700
00:41:36,120 --> 00:41:38,120
be updated in an efficient way.

701
00:41:38,120 --> 00:41:40,120
And we shown with
this real experiment

702
00:41:40,120 --> 00:41:44,720
that the semi-parametric
model worked

703
00:41:44,720 --> 00:41:47,710
as well as the
non-parametric one.

704
00:41:47,710 --> 00:41:49,970
But it's faster to
converge, because it

705
00:41:49,970 --> 00:41:55,230
has an initial knowledge
about the physics of the arm.

706
00:41:55,230 --> 00:42:00,450
And it is also better than
the fully parametric one,

707
00:42:00,450 --> 00:42:04,140
because it also
models, for example,

708
00:42:04,140 --> 00:42:08,680
dynamical effect due to
deflectability of the body.

709
00:42:08,680 --> 00:42:15,640
And dynamic deflectors
are usually not modeled

710
00:42:15,640 --> 00:42:18,704
by rigid body dynamic models.

711
00:42:18,704 --> 00:42:21,000
OK.

712
00:42:21,000 --> 00:42:25,940
Another thing I'm doing is
maintaining the Grand Unified

713
00:42:25,940 --> 00:42:27,860
Regularized Least
Squares library,

714
00:42:27,860 --> 00:42:32,750
which is a library for
regularized least squares,

715
00:42:32,750 --> 00:42:34,400
of course.

716
00:42:34,400 --> 00:42:36,161
It supports a large
scale dataset.

717
00:42:36,161 --> 00:42:40,350
This was developed in joint
exchange between MIT and IIT

718
00:42:40,350 --> 00:42:45,150
some years ago by
others, not by me.

719
00:42:45,150 --> 00:42:49,296
And it has a MATLAB
and a C++ interface.

720
00:42:49,296 --> 00:42:54,670
If you want to have a look
at how these methods work,

721
00:42:54,670 --> 00:42:56,730
I suggest you to
try out tutorials

722
00:42:56,730 --> 00:42:59,740
which are available on GitHub.

723
00:42:59,740 --> 00:43:00,950
GUILIA PASQUALE: I'm Guilia.

724
00:43:00,950 --> 00:43:05,310
And I work on the iCub
robot with my colleagues,

725
00:43:05,310 --> 00:43:07,490
especially on vision
and, in particular,

726
00:43:07,490 --> 00:43:10,310
on visual recognition.

727
00:43:10,310 --> 00:43:13,650
I work under the supervision
of Lorenzo Natale and Lorenzo

728
00:43:13,650 --> 00:43:14,510
Rosasco.

729
00:43:14,510 --> 00:43:19,210
Both will be here for a few
days in the following weeks.

730
00:43:19,210 --> 00:43:21,680
And the work that
I'm going to present

731
00:43:21,680 --> 00:43:23,870
has been done in
collaboration with Carlo

732
00:43:23,870 --> 00:43:29,130
and also Francesca Odone
from the University of Genoa.

733
00:43:29,130 --> 00:43:33,770
So in the last couple of
years, computer vision methods

734
00:43:33,770 --> 00:43:37,310
based on deep convolution
on neural networks

735
00:43:37,310 --> 00:43:40,790
have achieved a
remarkable performance

736
00:43:40,790 --> 00:43:43,910
in tasks such as
large scale image

737
00:43:43,910 --> 00:43:46,460
classification and retrieval.

738
00:43:46,460 --> 00:43:49,550
And the extreme success
of these methods

739
00:43:49,550 --> 00:43:52,430
is mainly due to the
increasing availability

740
00:43:52,430 --> 00:43:55,490
of all these larger datasets.

741
00:43:55,490 --> 00:43:59,270
And in particular, I'm referring
to the ImageNet one, which

742
00:43:59,270 --> 00:44:05,090
is composed by millions of
examples labeled into thousands

743
00:44:05,090 --> 00:44:08,020
of categories through
crowd sourcing methods

744
00:44:08,020 --> 00:44:10,970
such as the Amazon Turk.

745
00:44:10,970 --> 00:44:15,050
And in particular, the
increased data availability

746
00:44:15,050 --> 00:44:18,560
to gather with the increased
computational power

747
00:44:18,560 --> 00:44:22,670
has allowed to train deep
networks characterized

748
00:44:22,670 --> 00:44:27,080
by millions of parameters
in a supervised way

749
00:44:27,080 --> 00:44:31,520
from the image up
to the final label

750
00:44:31,520 --> 00:44:34,770
through the back
propagation algorithm.

751
00:44:34,770 --> 00:44:40,830
And this has allowed to mark a
breakthrough-- in particular,

752
00:44:40,830 --> 00:44:46,220
in 2012 when Alex
Krizhevsky proposed

753
00:44:46,220 --> 00:44:50,580
for the first time of a
network of this kind trained

754
00:44:50,580 --> 00:44:53,780
on ImageNet dataset
and definitely won

755
00:44:53,780 --> 00:44:56,870
the ImageNet large
scale user recognition

756
00:44:56,870 --> 00:44:59,000
challenge in this way.

757
00:44:59,000 --> 00:45:02,540
And the trend has been confirmed
in the following years.

758
00:45:02,540 --> 00:45:06,980
So that nowadays problems
such as large scale image

759
00:45:06,980 --> 00:45:10,670
classification or
detection are usually

760
00:45:10,670 --> 00:45:15,350
tackled following this
deep learning approach.

761
00:45:15,350 --> 00:45:20,150
And not only, it has been
also demonstrated at least

762
00:45:20,150 --> 00:45:21,752
empirically.

763
00:45:21,752 --> 00:45:23,070
Oh, I'm sorry.

764
00:45:23,070 --> 00:45:25,320
Maybe this is not
particularly clear.

765
00:45:25,320 --> 00:45:31,920
But this is the
Krizhevsky Network.

766
00:45:31,920 --> 00:45:36,440
Models of networks of this
kind trained on large datasets,

767
00:45:36,440 --> 00:45:42,140
such as the ImageNet
one, do provide also

768
00:45:42,140 --> 00:45:45,290
very good general
and powerful image

769
00:45:45,290 --> 00:45:51,500
descriptors to be applied also
on other tasks and datasets.

770
00:45:51,500 --> 00:45:54,110
In particular, it
is possible to use

771
00:45:54,110 --> 00:46:00,000
a convolutional neural network
trained on ImageNet dataset,

772
00:46:00,000 --> 00:46:04,190
feed it with
images, and using it

773
00:46:04,190 --> 00:46:08,660
as a black box extracting
the vectorial representation

774
00:46:08,660 --> 00:46:11,810
of the incoming images
as the output of one

775
00:46:11,810 --> 00:46:14,600
of the intermediate layers.

776
00:46:14,600 --> 00:46:18,640
Or even better, it is possible
to start from a network

777
00:46:18,640 --> 00:46:21,410
model trained on
the ImageNet dataset

778
00:46:21,410 --> 00:46:27,290
and fine tune its parameters
on a new dataset for a new task

779
00:46:27,290 --> 00:46:31,940
and achieving and surpassing
the state of the art--

780
00:46:31,940 --> 00:46:37,400
for example, also in the Pascal
dataset and other tasks--

781
00:46:37,400 --> 00:46:38,600
following this approach.

782
00:46:41,330 --> 00:46:48,240
So it is natural to
ask at this point, why?

783
00:46:48,240 --> 00:46:53,630
Instead, in robotics,
providing robots

784
00:46:53,630 --> 00:46:58,610
with robust and accurate
visual recognition capabilities

785
00:46:58,610 --> 00:47:03,110
in the real world is still one
of the greatest challenge that

786
00:47:03,110 --> 00:47:06,260
prevents the use of
autonomous agents

787
00:47:06,260 --> 00:47:10,030
for concrete applications.

788
00:47:10,030 --> 00:47:14,540
An actually, this is a problem
that is not only related

789
00:47:14,540 --> 00:47:20,190
to the iCub platform, but
it is also a limiting factor

790
00:47:20,190 --> 00:47:24,470
that the performance
of the latest

791
00:47:24,470 --> 00:47:27,050
robotics platforms,
such as the ones that

792
00:47:27,050 --> 00:47:30,880
have been participating, for
example, to the DARPA robotics

793
00:47:30,880 --> 00:47:32,980
challenge.

794
00:47:32,980 --> 00:47:40,700
Indeed, as you can
see here, robots

795
00:47:40,700 --> 00:47:48,500
are still either highly
tele-operated or complex

796
00:47:48,500 --> 00:47:50,491
methods.

797
00:47:50,491 --> 00:47:55,520
To, for example, map the 3D
structure on the environment

798
00:47:55,520 --> 00:48:00,080
and label it a priori must
be implemented in order

799
00:48:00,080 --> 00:48:03,080
to enable autonomous
agents to act in very

800
00:48:03,080 --> 00:48:06,650
controlled environments.

801
00:48:06,650 --> 00:48:13,460
So we decided to focus
on very simple settings

802
00:48:13,460 --> 00:48:15,750
where, in principle,
computer vision

803
00:48:15,750 --> 00:48:20,030
methods as the ones that
I've been describing you

804
00:48:20,030 --> 00:48:23,480
should be at least--

805
00:48:23,480 --> 00:48:25,550
well, should provide
very good performances.

806
00:48:25,550 --> 00:48:28,420
Because here the setting
is pretty simple.

807
00:48:28,420 --> 00:48:31,310
And we tried to
evaluate the performance

808
00:48:31,310 --> 00:48:34,290
of these deep learning
methods in these settings.

809
00:48:34,290 --> 00:48:37,260
Here you can see
the robot, that one,

810
00:48:37,260 --> 00:48:39,740
standing in front of a table.

811
00:48:39,740 --> 00:48:43,610
There is a human which
gives verbal instruction

812
00:48:43,610 --> 00:48:48,270
to the robot and also,
for example in this case,

813
00:48:48,270 --> 00:48:54,470
the label of the object to be
either learned or recognized.

814
00:48:54,470 --> 00:49:00,830
And the robot can focus his
attention on potential objects

815
00:49:00,830 --> 00:49:03,920
through bottom up segmentation
techniques-- for example,

816
00:49:03,920 --> 00:49:08,830
in this case, color or the other
saliency-based segmentation

817
00:49:08,830 --> 00:49:09,690
methods.

818
00:49:09,690 --> 00:49:12,230
I'm not going into the
detail of this setting,

819
00:49:12,230 --> 00:49:16,760
because you would see a
demo after my talk of this.

820
00:49:16,760 --> 00:49:20,150
Another setting that
we are considering

821
00:49:20,150 --> 00:49:22,130
is similar to the previous one.

822
00:49:22,130 --> 00:49:25,800
But this time, there is a human
standing in front of the robot.

823
00:49:25,800 --> 00:49:27,380
And there is no table.

824
00:49:27,380 --> 00:49:31,370
And the human, he's holding
the objects in his hands

825
00:49:31,370 --> 00:49:35,150
and is showing one
object after the other

826
00:49:35,150 --> 00:49:38,240
to the robot providing
the verbal annotation

827
00:49:38,240 --> 00:49:40,800
for that object.

828
00:49:40,800 --> 00:49:45,260
The robot in this
way, for example

829
00:49:45,260 --> 00:49:49,250
here, can exploit motion
detection techniques

830
00:49:49,250 --> 00:49:53,330
in order to localize the
object in the visual field

831
00:49:53,330 --> 00:49:55,130
and focus on it.

832
00:49:55,130 --> 00:49:58,820
The robot tracks the
object continuously,

833
00:49:58,820 --> 00:50:02,270
acquiring in this way
cropped the frames

834
00:50:02,270 --> 00:50:06,230
around the object that are
the training examples that

835
00:50:06,230 --> 00:50:10,470
will be used to learn
the object's appearance.

836
00:50:10,470 --> 00:50:17,060
So in general, this is
the recognition pipeline

837
00:50:17,060 --> 00:50:22,610
that is implemented to
perform both the two behaviors

838
00:50:22,610 --> 00:50:24,680
that I've been showing you.

839
00:50:24,680 --> 00:50:27,540
As you can see, the
input is the image,

840
00:50:27,540 --> 00:50:31,220
the stream of images from
one of the two cameras.

841
00:50:31,220 --> 00:50:34,190
Then there is the verbal
supervision of the teacher.

842
00:50:34,190 --> 00:50:38,030
Then there are
segmentation techniques

843
00:50:38,030 --> 00:50:41,690
in order to crop
region of interest

844
00:50:41,690 --> 00:50:46,550
from the incoming frame
and feed this crop

845
00:50:46,550 --> 00:50:49,220
to a convolutional
neural network.

846
00:50:49,220 --> 00:50:54,620
In this case, we are using
the famous Krizhevsky model.

847
00:50:54,620 --> 00:50:58,310
Then we encode
each incoming crop

848
00:50:58,310 --> 00:51:01,940
in a vector as the output
of one of the latest

849
00:51:01,940 --> 00:51:03,800
layers of the network.

850
00:51:03,800 --> 00:51:07,485
And we feed all these vectors
to a linear classifier,

851
00:51:07,485 --> 00:51:09,860
which is linear because, in
principle, the representation

852
00:51:09,860 --> 00:51:14,360
that we are extracting is good
enough for the discrimination

853
00:51:14,360 --> 00:51:16,890
that we want to perform.

854
00:51:16,890 --> 00:51:19,250
And so the classifier uses
these incoming vectors

855
00:51:19,250 --> 00:51:24,800
either as examples for
the training sector

856
00:51:24,800 --> 00:51:33,950
or assigns to each vector
the prediction label.

857
00:51:33,950 --> 00:51:38,330
And the output is an histogram
with the probabilities

858
00:51:38,330 --> 00:51:40,370
of all the classes.

859
00:51:40,370 --> 00:51:45,890
And the final outcome is the one
with the highest probability.

860
00:51:45,890 --> 00:51:49,440
And the histogram is
updated in real time.

861
00:51:49,440 --> 00:51:52,280
So this pipeline
can be used either

862
00:51:52,280 --> 00:51:58,280
for one or the other settings
that have been described you.

863
00:51:58,280 --> 00:52:05,330
So in particular, we
started from trying

864
00:52:05,330 --> 00:52:13,490
to list some requirements
that according to us

865
00:52:13,490 --> 00:52:17,510
are fundamental in
order to implement

866
00:52:17,510 --> 00:52:22,730
a sort of ideal robotic
visual recognition system.

867
00:52:22,730 --> 00:52:28,430
And these requirements
are usually not considered

868
00:52:28,430 --> 00:52:31,730
by typical computer vision
methods as the ones that

869
00:52:31,730 --> 00:52:36,860
have been described you,
but are the same fundamental

870
00:52:36,860 --> 00:52:39,730
if we want to
achieve human level

871
00:52:39,730 --> 00:52:44,000
performances in the settings
that I've been showing you.

872
00:52:44,000 --> 00:52:45,580
For example, first
of all, the system

873
00:52:45,580 --> 00:52:47,980
should be, as you
have seen, as much

874
00:52:47,980 --> 00:52:51,250
as possible self-supervised,
meaning that there must

875
00:52:51,250 --> 00:52:53,530
be techniques in order
to focus that robot's

876
00:52:53,530 --> 00:52:55,870
attention on the
object of interest

877
00:52:55,870 --> 00:52:59,140
and isolate them from
the visual field.

878
00:52:59,140 --> 00:53:01,600
Then hopefully, we
would like to come out

879
00:53:01,600 --> 00:53:04,210
with a system that is
reliable and robust

880
00:53:04,210 --> 00:53:06,880
to the variations
in the environment

881
00:53:06,880 --> 00:53:10,830
and also in the
object's appearance.

882
00:53:10,830 --> 00:53:14,930
Then also, as we are
in the real world,

883
00:53:14,930 --> 00:53:18,310
we would like a
system able to exploit

884
00:53:18,310 --> 00:53:21,260
the contextual information
that is available.

885
00:53:21,260 --> 00:53:23,170
For example-- the fact
that we are actually

886
00:53:23,170 --> 00:53:24,190
dealing with videos.

887
00:53:24,190 --> 00:53:27,760
So the frames are
temporarily correlated.

888
00:53:27,760 --> 00:53:29,740
And we are not dealing
with images in the wild,

889
00:53:29,740 --> 00:53:31,990
as the ImageNet case.

890
00:53:31,990 --> 00:53:34,252
And finally, as
Raffaello was mentioning,

891
00:53:34,252 --> 00:53:35,710
we would like to
have a system that

892
00:53:35,710 --> 00:53:39,310
is able to learn
incrementally to build always

893
00:53:39,310 --> 00:53:43,310
richer models of the
object through time.

894
00:53:43,310 --> 00:53:47,380
So we decided to evaluate
this recognition pipeline

895
00:53:47,380 --> 00:53:51,060
according to the criteria
that have been described you.

896
00:53:54,280 --> 00:54:00,290
And in order to provide
reproducibility to our study,

897
00:54:00,290 --> 00:54:03,370
we decided to acquire
a dataset on which

898
00:54:03,370 --> 00:54:06,080
to perform our analysis.

899
00:54:06,080 --> 00:54:09,730
However, we would like
also to be confident enough

900
00:54:09,730 --> 00:54:13,960
that the result that we
obtain on our benchmark

901
00:54:13,960 --> 00:54:18,400
will hold also in the
real usage of our system.

902
00:54:18,400 --> 00:54:20,050
And this is the
reason why we decided

903
00:54:20,050 --> 00:54:24,160
to acquire our dataset in the
same application setting where

904
00:54:24,160 --> 00:54:27,190
the robot usually operates.

905
00:54:27,190 --> 00:54:29,780
So this is the
iCubWork28 dataset

906
00:54:29,780 --> 00:54:32,780
that I acquired last year.

907
00:54:32,780 --> 00:54:35,620
As you can see, it's
composed by 28 objects

908
00:54:35,620 --> 00:54:38,970
divided into seven
categories and four

909
00:54:38,970 --> 00:54:41,120
instances per category.

910
00:54:41,120 --> 00:54:43,860
And I acquired it for
four different days

911
00:54:43,860 --> 00:54:46,720
in order to test also
incremental learning

912
00:54:46,720 --> 00:54:49,560
capabilities.

913
00:54:49,560 --> 00:54:52,810
The dataset is available
on the IIT website.

914
00:54:52,810 --> 00:54:54,740
And you can also
use it, for example,

915
00:54:54,740 --> 00:54:58,547
for the project of trust
five if you are interested.

916
00:55:01,470 --> 00:55:03,820
And this is an example
of the kind of videos

917
00:55:03,820 --> 00:55:09,190
that I acquired considering
one of the 28 objects.

918
00:55:09,190 --> 00:55:13,420
There are four videos for
the train, four for the test,

919
00:55:13,420 --> 00:55:16,960
acquired in four
different conditions.

920
00:55:16,960 --> 00:55:21,010
The object is undergoing random
transformations, mainly limited

921
00:55:21,010 --> 00:55:23,290
to 3D rotations.

922
00:55:23,290 --> 00:55:26,320
And as you can see, the
difference between the days

923
00:55:26,320 --> 00:55:30,130
is mainly limited to the
fact that we are just

924
00:55:30,130 --> 00:55:33,460
changing the conditions
in the environment--

925
00:55:33,460 --> 00:55:39,700
for example, the background
or the lighting conditions.

926
00:55:39,700 --> 00:55:42,850
And we acquired eight videos
for each of the 28 objects

927
00:55:42,850 --> 00:55:45,520
that I show you.

928
00:55:45,520 --> 00:55:49,840
So first of all, we
tried to find a measure,

929
00:55:49,840 --> 00:55:52,960
as I was saying
before, to quantify

930
00:55:52,960 --> 00:55:56,200
the confidence with
which we can expect

931
00:55:56,200 --> 00:55:59,590
that the results and the
performance that we observe

932
00:55:59,590 --> 00:56:02,040
on these benchmarks
will also hold

933
00:56:02,040 --> 00:56:04,470
in the real usage of the system.

934
00:56:04,470 --> 00:56:08,470
And to do this, first
of all, we focused only

935
00:56:08,470 --> 00:56:10,570
on object identification
for the moment.

936
00:56:10,570 --> 00:56:14,650
So the task is to discriminate
the specific instances

937
00:56:14,650 --> 00:56:19,180
of objects among the pool of 28.

938
00:56:19,180 --> 00:56:28,240
And we decided to estimate for
an increasing number of objects

939
00:56:28,240 --> 00:56:34,630
to be discriminated from 2 to
28 the empirical probability

940
00:56:34,630 --> 00:56:40,870
distribution of the
identification accuracy

941
00:56:40,870 --> 00:56:44,440
that we can observe
statistically

942
00:56:44,440 --> 00:56:46,510
for a fixed number of objects.

943
00:56:46,510 --> 00:56:49,150
That is depicted here in
the form of box plots.

944
00:56:51,740 --> 00:56:56,300
And also, we estimated for
each fixed number of objects

945
00:56:56,300 --> 00:57:00,800
to be discriminated
the minimum accuracy

946
00:57:00,800 --> 00:57:04,970
that we can expect to achieve
with increasing confidence

947
00:57:04,970 --> 00:57:06,920
levels.

948
00:57:06,920 --> 00:57:08,690
And this is a sort
of data sheet.

949
00:57:08,690 --> 00:57:15,080
The idea is to provide an
idea to an hypothetical user

950
00:57:15,080 --> 00:57:18,440
of the robot of the
identification accuracy

951
00:57:18,440 --> 00:57:22,940
that can be expected given
a certain pool of objects

952
00:57:22,940 --> 00:57:25,700
to be discriminated.

953
00:57:25,700 --> 00:57:28,110
So the second point that
I'll briefly describe you

954
00:57:28,110 --> 00:57:30,410
is the fact that we
investigated the effect

955
00:57:30,410 --> 00:57:38,190
of having a more or less precise
segmentation in the image.

956
00:57:38,190 --> 00:57:41,960
So we evaluated the task
of identifying the 28

957
00:57:41,960 --> 00:57:47,090
objects with different
levels of segmentation

958
00:57:47,090 --> 00:57:50,455
starting from the whole image
up to a very precise amount

959
00:57:50,455 --> 00:57:51,860
of segmentation of the objects.

960
00:57:51,860 --> 00:57:56,000
It can be seen
that, indeed, even

961
00:57:56,000 --> 00:57:59,490
if in principle these
convolutional networks are

962
00:57:59,490 --> 00:58:03,230
trained to classify
objects in the world image

963
00:58:03,230 --> 00:58:05,560
as it is in the
ImageNet dataset,

964
00:58:05,560 --> 00:58:08,370
it is still true
that in our case

965
00:58:08,370 --> 00:58:12,200
we observed that there is still
a large benefit from having

966
00:58:12,200 --> 00:58:15,950
a fine-grained segmentation.

967
00:58:15,950 --> 00:58:20,170
So probably the network is
not able to completely discard

968
00:58:20,170 --> 00:58:23,220
the new relevant information
that is in the background.

969
00:58:23,220 --> 00:58:25,550
So this is a possible
interesting direction

970
00:58:25,550 --> 00:58:27,680
of research.

971
00:58:27,680 --> 00:58:34,460
And finally, the last point
that I decided to tell you--

972
00:58:34,460 --> 00:58:36,110
I will skip on the
incremental part,

973
00:58:36,110 --> 00:58:40,670
because it's an ongoing work
that I'm doing with Raffaello--

974
00:58:40,670 --> 00:58:44,090
is about the exploitation
of the temporal contextual

975
00:58:44,090 --> 00:58:46,850
information.

976
00:58:46,850 --> 00:58:50,930
Here, you can see
the same kind of plot

977
00:58:50,930 --> 00:58:53,640
that I showed you before.

978
00:58:53,640 --> 00:58:56,270
So the task is object
identification,

979
00:58:56,270 --> 00:58:58,330
increasing number of objects.

980
00:58:58,330 --> 00:59:02,660
And the dot black line
represent the accuracy

981
00:59:02,660 --> 00:59:06,520
that you obtain if
you consider, as you

982
00:59:06,520 --> 00:59:09,800
were asking before, the
classification of each

983
00:59:09,800 --> 00:59:12,320
frame independently.

984
00:59:12,320 --> 00:59:14,870
So you can see that in this
case the accuracy that you get

985
00:59:14,870 --> 00:59:18,620
is pretty low, considering
that we have to discriminate

986
00:59:18,620 --> 00:59:21,390
between only 28 objects.

987
00:59:21,390 --> 00:59:24,860
However, it is also
true that as soon

988
00:59:24,860 --> 00:59:34,100
as you start considering instead
of the prediction given looking

989
00:59:34,100 --> 00:59:39,320
only at the current frame,
the most frequent prediction

990
00:59:39,320 --> 00:59:43,370
occurred in a temporal
window, so in the previous,

991
00:59:43,370 --> 00:59:46,190
let's say, 50 frames.

992
00:59:46,190 --> 00:59:50,150
You can boost your
condition accuracy a lot.

993
00:59:50,150 --> 00:59:54,050
As you can see here,
from green to red,

994
00:59:54,050 --> 00:59:56,920
increasing the length
of the temporal window

995
00:59:56,920 --> 00:59:59,480
increases the recognition
accuracy that you get.

996
00:59:59,480 --> 01:00:00,980
This is a very simple approach.

997
01:00:00,980 --> 01:00:05,360
But it is showing that actually
it is relevant in the fact

998
01:00:05,360 --> 01:00:08,390
that you are actually dealing
with videos instead of images

999
01:00:08,390 --> 01:00:09,590
in the wild.

1000
01:00:09,590 --> 01:00:13,460
And it is another
direction of research.

1001
01:00:13,460 --> 01:00:16,620
So finally, in the
last part of my talk,

1002
01:00:16,620 --> 01:00:18,470
I would like to tell
you about the work

1003
01:00:18,470 --> 01:00:24,050
that I'm actually doing
now, which is concerning

1004
01:00:24,050 --> 01:00:28,970
about most object
categorization tasks instead

1005
01:00:28,970 --> 01:00:30,860
of identification.

1006
01:00:30,860 --> 01:00:33,680
And this is the
reason why we decided

1007
01:00:33,680 --> 01:00:37,160
to acquire a new
dataset, which is

1008
01:00:37,160 --> 01:00:40,190
larger than the previous ones.

1009
01:00:40,190 --> 01:00:49,520
Because it is composed not
only by more categories, but,

1010
01:00:49,520 --> 01:00:53,210
in particular, by more
instances per category

1011
01:00:53,210 --> 01:00:57,530
in order to be able to perform
categorization experiments as I

1012
01:00:57,530 --> 01:00:58,670
told you.

1013
01:00:58,670 --> 01:01:02,660
Here, you can see the categories
with which we are starting

1014
01:01:02,660 --> 01:01:08,850
are 28 divided into seven
macro categories, let's say.

1015
01:01:08,850 --> 01:01:13,490
But the idea of this dataset
is to have a continuously

1016
01:01:13,490 --> 01:01:16,310
expandable in time dataset.

1017
01:01:16,310 --> 01:01:18,830
So there is an
application that we

1018
01:01:18,830 --> 01:01:20,850
used to acquire these datasets.

1019
01:01:20,850 --> 01:01:23,660
And the idea is to perform
periodical acquisitions

1020
01:01:23,660 --> 01:01:29,120
in order to incrementally enrich
the knowledge of the robot

1021
01:01:29,120 --> 01:01:31,440
about the objects in the scene.

1022
01:01:34,610 --> 01:01:38,990
Also, another important
factor regarding this dataset

1023
01:01:38,990 --> 01:01:42,290
is that differently
from the previous one

1024
01:01:42,290 --> 01:01:46,640
this would be divided and
tagged by nuisance factors.

1025
01:01:46,640 --> 01:01:50,030
And in particular,
for each object,

1026
01:01:50,030 --> 01:01:53,150
we are acquiring
different videos

1027
01:01:53,150 --> 01:01:57,320
where we isolate the
different transformations

1028
01:01:57,320 --> 01:01:59,390
that the object is undergoing.

1029
01:01:59,390 --> 01:02:02,350
So we have a video
where the object is just

1030
01:02:02,350 --> 01:02:03,720
at different scales.

1031
01:02:03,720 --> 01:02:05,870
Then it is rotating
on the plane.

1032
01:02:05,870 --> 01:02:08,709
Outside the plane,
it is translating.

1033
01:02:08,709 --> 01:02:10,250
And then there is
a final video where

1034
01:02:10,250 --> 01:02:14,490
all of this transformation
occurs simultaneously.

1035
01:02:14,490 --> 01:02:16,490
And finally, to
acquire this dataset we

1036
01:02:16,490 --> 01:02:24,290
decided to use the depth
information, so that in the end

1037
01:02:24,290 --> 01:02:26,720
we acquired both the left
and the right to cameras.

1038
01:02:26,720 --> 01:02:29,390
And in principle,
this information

1039
01:02:29,390 --> 01:02:34,160
could be used to obtain the
3D structure of the objects.

1040
01:02:34,160 --> 01:02:40,850
And this is the idea
that we used in order

1041
01:02:40,850 --> 01:02:44,570
to make the robot focusing
on the object of interest

1042
01:02:44,570 --> 01:02:47,080
using disparity.

1043
01:02:47,080 --> 01:02:48,890
Disparity is very
useful in this case,

1044
01:02:48,890 --> 01:02:57,410
because it allows to detect
unknown objects just given

1045
01:02:57,410 --> 01:03:02,720
the fact that we know that
we want the robot focused

1046
01:03:02,720 --> 01:03:05,690
on the closest
objects in the scene.

1047
01:03:05,690 --> 01:03:09,470
So it is a very
powerful method in order

1048
01:03:09,470 --> 01:03:11,240
to have the robot
tracking a known

1049
01:03:11,240 --> 01:03:16,720
object with all different
lighting conditions and so on.

1050
01:03:16,720 --> 01:03:17,220
Yeah.

1051
01:03:17,220 --> 01:03:22,070
And here, you can see
this is the left camera.

1052
01:03:22,070 --> 01:03:24,230
This is the disparity map.

1053
01:03:24,230 --> 01:03:27,680
This is its segmentation, which
provides an approximate region

1054
01:03:27,680 --> 01:03:29,540
of interest around the object.

1055
01:03:29,540 --> 01:03:31,010
And this is the final output.

1056
01:03:34,010 --> 01:03:36,940
So I started
acquiring the first--

1057
01:03:36,940 --> 01:03:41,330
well, it should be red, but
it's not very clear I mean.

1058
01:03:41,330 --> 01:03:43,970
I started acquiring
the first categories

1059
01:03:43,970 --> 01:03:48,470
among these 21
listed here, which

1060
01:03:48,470 --> 01:03:53,570
are the squeezer, sprayer,
the cream, the oven

1061
01:03:53,570 --> 01:03:55,150
glove, and the bottle.

1062
01:03:55,150 --> 01:03:59,120
For each row, you see the tiny
instances that I collected.

1063
01:03:59,120 --> 01:04:01,400
And the idea is to
continue acquiring them

1064
01:04:01,400 --> 01:04:03,860
when I come back in Genoa.

1065
01:04:03,860 --> 01:04:09,720
And so here, you can see an
example of the five videos.

1066
01:04:09,720 --> 01:04:12,100
Actually, I acquired
10 videos per object,

1067
01:04:12,100 --> 01:04:16,400
five for the training set
and five for the test set.

1068
01:04:16,400 --> 01:04:19,710
And you can see that
in the different videos

1069
01:04:19,710 --> 01:04:23,030
the object is undergoing
different transformations.

1070
01:04:23,030 --> 01:04:28,090
And this is the final one while
these transformation are mixed.

1071
01:04:28,090 --> 01:04:32,210
Oh, the images here
are not segmented yet.

1072
01:04:32,210 --> 01:04:33,920
So you can see the whole image.

1073
01:04:33,920 --> 01:04:38,000
But in end, also the information
about the segmentation

1074
01:04:38,000 --> 01:04:41,280
and disparity and so
on will be available.

1075
01:04:41,280 --> 01:04:44,390
And this dataset
regarding the 50 object

1076
01:04:44,390 --> 01:04:47,780
that I acquired together
with the application

1077
01:04:47,780 --> 01:04:50,000
that I'm using to
acquire the dataset

1078
01:04:50,000 --> 01:04:52,370
are available if you
are willing to use them

1079
01:04:52,370 --> 01:04:55,730
for the projects in trust
five, for example, in order

1080
01:04:55,730 --> 01:04:58,190
to investigate the
invariant properties

1081
01:04:58,190 --> 01:05:01,170
of the different
representations.

1082
01:05:01,170 --> 01:05:05,220
And so that's it.