1 00:00:01,640 --> 00:00:04,040 The following content is provided under a Creative 2 00:00:04,040 --> 00:00:05,580 Commons license. 3 00:00:05,580 --> 00:00:07,880 Your support will help MIT OpenCourseWare 4 00:00:07,880 --> 00:00:12,270 continue to offer high quality educational resources for free. 5 00:00:12,270 --> 00:00:14,870 To make a donation or view additional materials 6 00:00:14,870 --> 00:00:18,830 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,830 --> 00:00:21,991 at ocw.mit.edu. 8 00:00:21,991 --> 00:00:23,240 CARLO CILIBERTO: Good morning. 9 00:00:23,240 --> 00:00:26,330 So today, there is a bit if a overview 10 00:00:26,330 --> 00:00:33,680 on the iCub Robot, and it will be about one hour, one 11 00:00:33,680 --> 00:00:34,950 hour and half. 12 00:00:34,950 --> 00:00:39,120 And we organized the schedule for this time 13 00:00:39,120 --> 00:00:44,660 in series of four small talks and then a demo 14 00:00:44,660 --> 00:00:47,240 from the iCub, live demo. 15 00:00:47,240 --> 00:00:49,760 So I will give you an overview of the kind of fields 16 00:00:49,760 --> 00:00:53,590 and capabilities that the iCub has developed so far, 17 00:00:53,590 --> 00:00:56,810 while Alessandro, Rafaello, and Giulia will show you 18 00:00:56,810 --> 00:00:58,850 what's going on right now on the iCub, part 19 00:00:58,850 --> 00:01:00,760 of what's going on the robot. 20 00:01:00,760 --> 00:01:05,940 So they are going to talk about their recent work. 21 00:01:05,940 --> 00:01:09,085 So let's start with the presentation. 22 00:01:09,085 --> 00:01:12,140 This is the iCub, which is a child humanoid robot. 23 00:01:12,140 --> 00:01:14,360 This size. 24 00:01:14,360 --> 00:01:21,660 And the project of the iCub began in 2004, and the iCub-- 25 00:01:21,660 --> 00:01:24,180 actually, the iCubs, because there are many of them-- 26 00:01:24,180 --> 00:01:27,260 they were built in Genoa at the Italian Institute 27 00:01:27,260 --> 00:01:29,120 of Technology. 28 00:01:29,120 --> 00:01:33,575 And the main motivation behind the creation and the design 29 00:01:33,575 --> 00:01:39,620 of this platform was to actually have a way 30 00:01:39,620 --> 00:01:43,236 to have a platform in order to study how intelligence, 31 00:01:43,236 --> 00:01:47,500 how cognition emerges in artificial embodied systems. 32 00:01:47,500 --> 00:01:51,830 So Giulio Sandini and Giorgio Metta, that you can see there, 33 00:01:51,830 --> 00:01:58,090 are the actual original founders of iCub world, 34 00:01:58,090 --> 00:02:02,900 so they are both directors at the IIT. 35 00:02:02,900 --> 00:02:05,690 And this is a bit of a timeline that I drew. 36 00:02:05,690 --> 00:02:09,570 Actually, there are many other things going on 37 00:02:09,570 --> 00:02:11,570 during all these 11 years. 38 00:02:11,570 --> 00:02:15,350 This is a video celebrating the actual first 10 years 39 00:02:15,350 --> 00:02:16,460 of the project. 40 00:02:16,460 --> 00:02:20,146 And actually, you can see many more things 41 00:02:20,146 --> 00:02:21,520 that the iCub will be able to do. 42 00:02:21,520 --> 00:02:23,870 But I selected this part because I 43 00:02:23,870 --> 00:02:26,810 think they can be useful, also, if you are interested in doing 44 00:02:26,810 --> 00:02:29,870 projects with the robot, to have an idea of the kind of skills 45 00:02:29,870 --> 00:02:33,340 and the kind of feedback that the robot can provide you, 46 00:02:33,340 --> 00:02:35,180 to do an experiment. 47 00:02:35,180 --> 00:02:39,900 So as I told you, iCub was built with the idea 48 00:02:39,900 --> 00:02:43,580 of replicating an artificial embodied system that 49 00:02:43,580 --> 00:02:46,410 could explore the environment and learn from it. 50 00:02:46,410 --> 00:02:48,752 So it has many different sensors. 51 00:02:48,752 --> 00:02:49,710 These are some of them. 52 00:02:49,710 --> 00:02:54,650 So in the head, we have one accelerometer and the gyroscope 53 00:02:54,650 --> 00:02:58,940 in order to provide the inertial feedback to the system, 54 00:02:58,940 --> 00:03:02,750 and two Dragonfly cameras that provide medium resolution 55 00:03:02,750 --> 00:03:03,950 images. 56 00:03:03,950 --> 00:03:08,180 And as you can see there, it's all 57 00:03:08,180 --> 00:03:11,060 about one meter, one meter and something, 58 00:03:11,060 --> 00:03:12,930 and it's pretty light-- 59 00:03:12,930 --> 00:03:16,210 55 kilograms-- and has a lot of degrees of freedom. 60 00:03:16,210 --> 00:03:18,440 So 53 degrees of freedom, and they 61 00:03:18,440 --> 00:03:23,870 allow the robot to perform many complicated actions. 62 00:03:23,870 --> 00:03:26,960 It's provided with torque and force sensors, 63 00:03:26,960 --> 00:03:31,370 and I will go over these in a minute. 64 00:03:31,370 --> 00:03:35,985 Its whole body, or at least the covered part of the robot-- 65 00:03:35,985 --> 00:03:37,610 the black part that you can see there-- 66 00:03:37,610 --> 00:03:39,920 are all covered in artificial skin. 67 00:03:39,920 --> 00:03:42,230 So they provide feedback about contact 68 00:03:42,230 --> 00:03:44,420 with the external world. 69 00:03:44,420 --> 00:03:47,090 And it has also microphones mounted on the head, 70 00:03:47,090 --> 00:03:49,810 but probably for sound and speech recognition 71 00:03:49,810 --> 00:03:54,350 is better to use direct microfeedback at the moment, 72 00:03:54,350 --> 00:03:56,570 because, of course, noise-canceling problems 73 00:03:56,570 --> 00:03:58,140 and so on. 74 00:03:58,140 --> 00:04:03,470 If you're interested in speech and found feedback, 75 00:04:03,470 --> 00:04:07,256 we are going to use other kind of microphone. 76 00:04:07,256 --> 00:04:09,740 So during these 11 years, iCub has 77 00:04:09,740 --> 00:04:11,510 been involved in many, many projects, 78 00:04:11,510 --> 00:04:14,275 and indeed, part of what I'm going to show you 79 00:04:14,275 --> 00:04:18,019 is the result of the joint effort of many labs, mainly 80 00:04:18,019 --> 00:04:18,649 in Europe. 81 00:04:18,649 --> 00:04:20,399 These are mostly European projects. 82 00:04:20,399 --> 00:04:24,440 But iCub is also an international partner 83 00:04:24,440 --> 00:04:27,140 of the CBMM Project. 84 00:04:29,912 --> 00:04:31,940 So regarding force/torque sensors, 85 00:04:31,940 --> 00:04:35,997 they are these sensors that you can see there. 86 00:04:35,997 --> 00:04:44,520 So they provide [INAUDIBLE] and also [INAUDIBLE].. 87 00:04:48,471 --> 00:04:51,070 And they're mounted in each of the limbs 88 00:04:51,070 --> 00:04:52,780 of the robot and the torso. 89 00:04:52,780 --> 00:04:59,065 And they allow the robot to [INAUDIBLE] interaction 90 00:04:59,065 --> 00:04:59,690 with the world. 91 00:04:59,690 --> 00:05:01,640 And indeed, with this kind of object, 92 00:05:01,640 --> 00:05:03,100 it can do many different things. 93 00:05:03,100 --> 00:05:06,480 For instance in this video, I'm showing an example 94 00:05:06,480 --> 00:05:10,050 of how the feedback provided by the force/torque sensor 95 00:05:10,050 --> 00:05:13,540 can be used to guide the robot and teach it to learn 96 00:05:13,540 --> 00:05:15,270 different kind of action. 97 00:05:15,270 --> 00:05:17,860 Like, in this case, a pouring action and then 98 00:05:17,860 --> 00:05:21,420 repeat it and maybe try to generalize it. 99 00:05:21,420 --> 00:05:26,860 So force/torque sensors provide the robot feedback 100 00:05:26,860 --> 00:05:29,890 about the intensity of the interaction 101 00:05:29,890 --> 00:05:33,160 with the external world. 102 00:05:33,160 --> 00:05:36,190 But they're not allowed to have the robot have 103 00:05:36,190 --> 00:05:39,490 an idea of where this kind of interaction is occurring. 104 00:05:39,490 --> 00:05:43,930 So, for this, we have artificial skin covering the robot, 105 00:05:43,930 --> 00:05:46,340 as I told you. 106 00:05:46,340 --> 00:05:49,450 And the technology used for the kind of thing, 107 00:05:49,450 --> 00:05:52,990 that you can see here, for the palm of the hand of the robot, 108 00:05:52,990 --> 00:05:59,320 is capacitive and it's similar to the technology used 109 00:05:59,320 --> 00:06:01,386 for smart phones. 110 00:06:07,630 --> 00:06:10,360 It's using, if you can see these yellow dots. 111 00:06:10,360 --> 00:06:16,375 They are all electrodes that, together with another arm, 112 00:06:16,375 --> 00:06:18,240 they form a capacitor. 113 00:06:23,950 --> 00:06:26,690 The way the arm works of this capacitor 114 00:06:26,690 --> 00:06:29,170 are formed when there is an interaction 115 00:06:29,170 --> 00:06:30,840 with the environment allowed to provide 116 00:06:30,840 --> 00:06:36,780 the feedback of the kind of intensity [INAUDIBLE] 117 00:06:36,780 --> 00:06:37,910 the location itself. 118 00:06:37,910 --> 00:06:40,960 It's providing information about where 119 00:06:40,960 --> 00:06:42,250 this interaction is occurring. 120 00:06:42,250 --> 00:06:45,310 And the artifical skin is actually 121 00:06:45,310 --> 00:06:48,410 really useful for an embodied agent. 122 00:06:48,410 --> 00:06:52,000 And for reasons like we see in this video, 123 00:06:52,000 --> 00:06:57,700 without using this feedback, if you have a very light object, 124 00:06:57,700 --> 00:07:01,245 the robot is not able to detect the process is interrupting 125 00:07:01,245 --> 00:07:01,870 with something. 126 00:07:01,870 --> 00:07:05,230 It's just closing the end and it doesn't have any feedback. 127 00:07:05,230 --> 00:07:07,770 It's crushing the object. 128 00:07:07,770 --> 00:07:12,884 By using the sensors on the fingertips of the hand, 129 00:07:12,884 --> 00:07:14,800 the robot is able to detect that it's actually 130 00:07:14,800 --> 00:07:17,740 touching something, and therefore, it's 131 00:07:17,740 --> 00:07:22,120 stopping the action without crushing the object. 132 00:07:22,120 --> 00:07:28,140 So other useful things that can be done with artificial skin. 133 00:07:28,140 --> 00:07:31,120 This is an example of combining the information 134 00:07:31,120 --> 00:07:34,420 from the force/torque sensor and the artificial skin. 135 00:07:34,420 --> 00:07:37,620 So the artificial skin allows to detect where 136 00:07:37,620 --> 00:07:40,660 and the direction of the kind of force 137 00:07:40,660 --> 00:07:46,750 that is applied to the robot. 138 00:07:46,750 --> 00:07:52,225 And in that case, the robot is counterbalancing. 139 00:07:52,225 --> 00:07:55,710 It's negating the effect of gravity and of internal forces. 140 00:07:55,710 --> 00:07:58,760 So it's basically having arm floating around like 141 00:07:58,760 --> 00:08:00,070 if it was in space. 142 00:08:00,070 --> 00:08:01,630 And there is no friction. 143 00:08:01,630 --> 00:08:07,360 So by touching the arm of the robot, 144 00:08:07,360 --> 00:08:11,550 we have the arm drifting in the direction opposite 145 00:08:11,550 --> 00:08:14,870 to where the force is applied, or the torque. 146 00:08:14,870 --> 00:08:17,110 You can see that, the arm is turning. 147 00:08:17,110 --> 00:08:21,160 But as you can see, like it was in space without any friction, 148 00:08:21,160 --> 00:08:23,770 because it's actually negating both gravity 149 00:08:23,770 --> 00:08:26,390 and internal forces. 150 00:08:26,390 --> 00:08:30,020 And finally, again about the artificial skin, 151 00:08:30,020 --> 00:08:31,840 there's some work by Alessandro, but he's 152 00:08:31,840 --> 00:08:33,298 going to talk about something else, 153 00:08:33,298 --> 00:08:38,100 but I found is particularly interesting to show him. 154 00:08:38,100 --> 00:08:45,710 This is an example of the robot self-calibrating 155 00:08:45,710 --> 00:08:51,100 the model of its own body, with respect to its own hand. 156 00:08:51,100 --> 00:08:54,670 The idea is to have the robot use the tactile feedback 157 00:08:54,670 --> 00:08:59,800 from the fingertip and the skin of its forearm, for instance, 158 00:08:59,800 --> 00:09:08,500 to learn the position between the fingertip and the arm. 159 00:09:08,500 --> 00:09:12,850 And therefore, it's able, first by touching itself, 160 00:09:12,850 --> 00:09:14,830 to learn the correspondence and then 161 00:09:14,830 --> 00:09:18,890 to actually show that it has learned 162 00:09:18,890 --> 00:09:22,750 this kind of correlation by reaching the point when someone 163 00:09:22,750 --> 00:09:24,230 else touches the thing point. 164 00:09:24,230 --> 00:09:26,410 And tries to reach it. 165 00:09:26,410 --> 00:09:29,020 And therefore, this can be seen as a way of self-calibrating 166 00:09:29,020 --> 00:09:31,540 without the need of a model of the kinematics of the robot. 167 00:09:31,540 --> 00:09:34,840 The robot would just be able to explore itself and learn 168 00:09:34,840 --> 00:09:38,930 how different part of it's body relate one to the other. 169 00:09:38,930 --> 00:09:45,030 And again, related to self-calibration sometimes, 170 00:09:45,030 --> 00:09:47,980 but this is more of a calibration between vision 171 00:09:47,980 --> 00:09:49,690 and motor activity. 172 00:09:49,690 --> 00:09:58,510 This is a work appeared in 2014 in which, basically, 173 00:09:58,510 --> 00:10:02,150 the correlation between the kind of actions that the robot is 174 00:10:02,150 --> 00:10:07,910 able to perform is calibrated with respect to its ability 175 00:10:07,910 --> 00:10:09,476 to perceive the work. 176 00:10:09,476 --> 00:10:12,800 So, in this kind of video I'm going to show, 177 00:10:12,800 --> 00:10:16,520 the robot is trying to reach for an object 178 00:10:16,520 --> 00:10:24,800 and failing in its action because the actual model 179 00:10:24,800 --> 00:10:27,380 of the word that you use to perform the reaching 180 00:10:27,380 --> 00:10:32,350 is not aligned with the 3-D model provided by vision. 181 00:10:32,350 --> 00:10:35,480 And this can happen due to smaller errors 182 00:10:35,480 --> 00:10:39,470 in the kinematics or in the vision 183 00:10:39,470 --> 00:10:43,480 and therefore even just small errors 184 00:10:43,480 --> 00:10:45,680 cause a complete failure of the system. 185 00:10:45,680 --> 00:10:50,795 And therefore, by trying to correlate that information, 186 00:10:50,795 --> 00:10:54,740 the one from the kinematics, in this case the robot 187 00:10:54,740 --> 00:11:01,370 is using it on the fingertip to the see where it actually 188 00:11:01,370 --> 00:11:03,150 is in the image. 189 00:11:03,150 --> 00:11:08,150 And when the kinematic model predicts the hand should be. 190 00:11:08,150 --> 00:11:11,120 So the green dot is the point, predicted by kinematics, 191 00:11:11,120 --> 00:11:14,480 of where the system expects the fingertip to be, 192 00:11:14,480 --> 00:11:16,860 and where actually the fingertip is. 193 00:11:16,860 --> 00:11:19,270 Of course, by learning this relation, 194 00:11:19,270 --> 00:11:22,940 the robot is able to cope with this kind of misalignment. 195 00:11:22,940 --> 00:11:27,155 And therefore, after a first calibration phase, 196 00:11:27,155 --> 00:11:31,948 it's able to perform the reaching action, successfully, 197 00:11:31,948 --> 00:11:34,630 as you will see in a moment. 198 00:11:34,630 --> 00:11:38,580 Also, this kind of ability of calibrating 199 00:11:38,580 --> 00:11:42,590 would be pretty useful in case of the situations in which 200 00:11:42,590 --> 00:11:45,500 the robot is damaged, and therefore, it's actual model 201 00:11:45,500 --> 00:11:47,780 changes completely. 202 00:11:47,780 --> 00:11:50,860 As you can see, now it's reaching 203 00:11:50,860 --> 00:11:54,629 and it's performing the grasp correctly. 204 00:12:01,190 --> 00:12:07,010 Finally, before going with the actual talk, 205 00:12:07,010 --> 00:12:10,710 I'm going to show a final video about balancing. 206 00:12:10,710 --> 00:12:13,460 Some of you have asked if the robot works, 207 00:12:13,460 --> 00:12:16,760 and the robot is currently not able to work, 208 00:12:16,760 --> 00:12:18,262 but this is a video from the people 209 00:12:18,262 --> 00:12:19,970 from the group that is actually in charge 210 00:12:19,970 --> 00:12:22,325 of making the robot work. 211 00:12:22,325 --> 00:12:24,200 The first step, of course, will be balancing. 212 00:12:24,200 --> 00:12:26,950 And this is an example of it. 213 00:12:26,950 --> 00:12:29,630 It's actually one foot balancing where 214 00:12:29,630 --> 00:12:35,930 multiple components of what I've shown you about the iCub 215 00:12:35,930 --> 00:12:37,322 so far. 216 00:12:37,322 --> 00:12:39,530 Torque sensing, initial sensing are combined together 217 00:12:39,530 --> 00:12:43,070 to have the robot stand on one foot. 218 00:12:43,070 --> 00:12:48,060 And also be able to cope with the interaction 219 00:12:48,060 --> 00:12:53,503 with internal forces that could try to have it fall. 220 00:13:00,420 --> 00:13:03,250 They are applying forces to the robot 221 00:13:03,250 --> 00:13:06,500 and it's able to detect the force 222 00:13:06,500 --> 00:13:11,554 and to cope a bit with the forces but to stay stable. 223 00:13:11,554 --> 00:13:14,720 OK so, this was just a brief overview 224 00:13:14,720 --> 00:13:19,280 of some things that can be done with the iCub. 225 00:13:19,280 --> 00:13:21,860 And actually, the next talk will be a bit more 226 00:13:21,860 --> 00:13:24,430 about what's going on with it. 227 00:13:29,270 --> 00:13:31,910 ALESSANDRO RONCONE: I want to talk to you about part 228 00:13:31,910 --> 00:13:38,595 of my PhD project that was about tackling the perception 229 00:13:38,595 --> 00:13:39,095 problem. 230 00:13:44,510 --> 00:13:48,980 Tackling the perception problem through the use 231 00:13:48,980 --> 00:13:50,810 of multisensor integration. 232 00:13:50,810 --> 00:13:55,920 And specifically, I narrowed down this big problem 233 00:13:55,920 --> 00:14:00,350 by implementing a model of PeriPersonal Space on the iCub. 234 00:14:00,350 --> 00:14:04,130 That is, biology inspired approach. 235 00:14:04,130 --> 00:14:06,140 PeriPersonal Space is a concept that 236 00:14:06,140 --> 00:14:10,040 has been known in neuroscience and psychology for years. 237 00:14:10,040 --> 00:14:13,760 And so, let me start with what PeriPersonal Space is, 238 00:14:13,760 --> 00:14:18,050 and why it is so important for humans and animals. 239 00:14:18,050 --> 00:14:20,120 It is defined as the space around us, 240 00:14:20,120 --> 00:14:23,610 within which objects can be grasped and manipulated. 241 00:14:23,610 --> 00:14:26,540 It's an interface, basically, between our body 242 00:14:26,540 --> 00:14:27,830 and the external world. 243 00:14:27,830 --> 00:14:30,080 And for this reason, it benefits from 244 00:14:30,080 --> 00:14:32,720 a multimodal integrated representation 245 00:14:32,720 --> 00:14:36,320 that merges information between different modalities. 246 00:14:36,320 --> 00:14:39,710 And historically, these have been the vision system, 247 00:14:39,710 --> 00:14:42,980 the tactile system, the perception, auditory system, 248 00:14:42,980 --> 00:14:45,570 and even the motor system. 249 00:14:45,570 --> 00:14:47,180 Historically, it has been studied 250 00:14:47,180 --> 00:14:49,070 by two different fields. 251 00:14:49,070 --> 00:14:53,220 The neurophysiology on one side, and all that 252 00:14:53,220 --> 00:14:57,440 could be related to psychology and developmental psychology 253 00:14:57,440 --> 00:14:58,400 on the other. 254 00:14:58,400 --> 00:15:00,920 They basically follow the two different approaches, 255 00:15:00,920 --> 00:15:04,400 the former being bottom up, the latter being top down. 256 00:15:04,400 --> 00:15:07,090 And they came out with different outcomes. 257 00:15:07,090 --> 00:15:11,420 And the former emphasizes the role of the perception 258 00:15:11,420 --> 00:15:14,030 and it interplays with the motor system 259 00:15:14,030 --> 00:15:18,980 in the control of movement, whereas the latter was focusing 260 00:15:18,980 --> 00:15:21,660 mainly on the multisensory aspect, that 261 00:15:21,660 --> 00:15:25,340 is, how different modalities were combined together 262 00:15:25,340 --> 00:15:28,820 in order to form a coherent view of the body 263 00:15:28,820 --> 00:15:31,010 and the nearby space. 264 00:15:31,010 --> 00:15:33,300 Luckily, in recent years, they decided 265 00:15:33,300 --> 00:15:37,190 to converge to a common ground, and a shared interpretation, 266 00:15:37,190 --> 00:15:38,810 and for the purposes of my work I 267 00:15:38,810 --> 00:15:42,200 would like to highlight the main aspect. 268 00:15:42,200 --> 00:15:45,680 Firstly, and this one might be of interest from an engineering 269 00:15:45,680 --> 00:15:46,710 perspective. 270 00:15:46,710 --> 00:15:51,650 PeriPersonal Space is made of different reference 271 00:15:51,650 --> 00:15:55,220 points that are located in different regions of the brain. 272 00:15:55,220 --> 00:15:57,350 And there might be a way for the brain 273 00:15:57,350 --> 00:16:01,430 to switch from one to another, according to different contact 274 00:16:01,430 --> 00:16:02,630 and goal. 275 00:16:02,630 --> 00:16:07,040 And secondly, as I was saying, PeriPersonal Space 276 00:16:07,040 --> 00:16:09,500 benefits from a multisensory integration 277 00:16:09,500 --> 00:16:12,300 in order to form a coherent view of the body understanding 278 00:16:12,300 --> 00:16:13,010 space. 279 00:16:13,010 --> 00:16:15,900 In this experiment made by Fogassi in 1996, 280 00:16:15,900 --> 00:16:20,470 they basically found a number of so-called visual tactile 281 00:16:20,470 --> 00:16:24,850 neurons that are set up neurons that fire both stimulated 282 00:16:24,850 --> 00:16:28,400 in a specific skin part and if an object is 283 00:16:28,400 --> 00:16:31,140 presented in surrounding space. 284 00:16:31,140 --> 00:16:34,480 So this means that these neurons code 285 00:16:34,480 --> 00:16:37,080 both the visual information and the tactile information. 286 00:16:37,080 --> 00:16:39,680 But they also have some proprioceptive information, 287 00:16:39,680 --> 00:16:42,260 because they are basically attached to the body part 288 00:16:42,260 --> 00:16:44,090 that they belong to. 289 00:16:44,090 --> 00:16:48,140 Lastly, one of the main properties of this presentation 290 00:16:48,140 --> 00:16:50,690 is in basic plasticity. 291 00:16:50,690 --> 00:16:52,910 And for example, in this experiment, 292 00:16:52,910 --> 00:16:58,490 made by Iriki ten years ago, the extension of this receptive 293 00:16:58,490 --> 00:17:01,850 field in the visual space, in the surrounding space, 294 00:17:01,850 --> 00:17:04,849 after training with a rake, have been 295 00:17:04,849 --> 00:17:07,670 shown to go up to enclose the tool 296 00:17:07,670 --> 00:17:13,000 as if this tool becomes part of the body. 297 00:17:13,000 --> 00:17:15,170 So through experience and through tool use, 298 00:17:15,170 --> 00:17:21,369 the monkey was able to grow this receptive field. 299 00:17:25,910 --> 00:17:27,680 Those are properties that are very nice, 300 00:17:27,680 --> 00:17:30,680 and we would like them to be available for the robot. 301 00:17:30,680 --> 00:17:32,690 And, in general robotics, the work 302 00:17:32,690 --> 00:17:36,380 related to PeriPersonal Space can be divided into two groups. 303 00:17:36,380 --> 00:17:40,910 On the one side, the model, and the simulation basically. 304 00:17:40,910 --> 00:17:43,565 The closest one to my work was the one 305 00:17:43,565 --> 00:17:46,800 from Fuke, a colleague that are from [INAUDIBLE] lab, 306 00:17:46,800 --> 00:17:49,610 in which they used a stimulated robot in order 307 00:17:49,610 --> 00:17:55,340 to model the mechanisms that are leading 308 00:17:55,340 --> 00:17:57,650 to this visual-tactile presentation. 309 00:17:57,650 --> 00:18:00,650 On the other side, there are the engineering approaches 310 00:18:00,650 --> 00:18:02,320 that are few. 311 00:18:02,320 --> 00:18:05,210 The closest one is this one by Mittendorfer 312 00:18:05,210 --> 00:18:08,220 from Gordon Cheng's lab, in which they first 313 00:18:08,220 --> 00:18:09,680 developed the multimodal skin. 314 00:18:09,680 --> 00:18:12,990 So they developed the hardware to be able to do that. 315 00:18:12,990 --> 00:18:19,770 And then they use the to trigger local avoidance responses, 316 00:18:19,770 --> 00:18:22,280 reflexes to incoming objects. 317 00:18:22,280 --> 00:18:24,900 We are trying to position ourself in the middle. 318 00:18:24,900 --> 00:18:26,710 Let's say, we are not trying to create 319 00:18:26,710 --> 00:18:28,540 a perfect model of PeriPersonal Space 320 00:18:28,540 --> 00:18:32,690 from a biological perspective. 321 00:18:32,690 --> 00:18:34,190 But on the other side, we would like 322 00:18:34,190 --> 00:18:38,750 to have something that is also working, and useful 323 00:18:38,750 --> 00:18:41,490 for our proposals. 324 00:18:41,490 --> 00:18:44,680 So from now on, I will divide the presentation in two parts. 325 00:18:44,680 --> 00:18:47,250 The first will be about the model. 326 00:18:47,250 --> 00:18:50,990 So what we think will be useful for tackling the problem, 327 00:18:50,990 --> 00:18:54,500 on the other side I show you an application of this model, that 328 00:18:54,500 --> 00:18:58,470 is basically using the low-color presentation in order 329 00:18:58,470 --> 00:19:02,600 to trigger avoidance responses or reaching responses 330 00:19:02,600 --> 00:19:06,140 distributed throughout the body. 331 00:19:06,140 --> 00:19:07,940 So let me start with the proposed 332 00:19:07,940 --> 00:19:10,580 model of PeriPersonal Space. 333 00:19:10,580 --> 00:19:14,070 Loosely inspired by the neurophysiological findings 334 00:19:14,070 --> 00:19:19,190 like we discussed before, we developed this PeriPersonal 335 00:19:19,190 --> 00:19:21,980 Space presentation by means of access 336 00:19:21,980 --> 00:19:25,760 of facial receptive field that we are going out 337 00:19:25,760 --> 00:19:27,440 from the robot's skin. 338 00:19:27,440 --> 00:19:30,780 So basically, they were extending the tactile domain 339 00:19:30,780 --> 00:19:33,200 into nearby space. 340 00:19:33,200 --> 00:19:40,030 Each tactile, that is, each pair the iCub skin is composed of, 341 00:19:40,030 --> 00:19:43,130 will experience a set of multisensory events. 342 00:19:43,130 --> 00:19:47,450 So basically, you are letting the robot 343 00:19:47,450 --> 00:19:52,340 learn this visual-tactile sensations by taking an object 344 00:19:52,340 --> 00:19:55,890 and making contact on the skin part. 345 00:19:55,890 --> 00:19:58,760 We learn it by tactile experience 346 00:19:58,760 --> 00:20:02,840 we learn a sort of probability of being touched prior 347 00:20:02,840 --> 00:20:06,520 to contact activation when the new incoming object is 348 00:20:06,520 --> 00:20:08,510 presented. 349 00:20:08,510 --> 00:20:12,590 And we basically created this cone shape receptive field 350 00:20:12,590 --> 00:20:15,350 that is going from each of the taxels. 351 00:20:15,350 --> 00:20:19,610 And for any object that is entering this receptive field, 352 00:20:19,610 --> 00:20:25,210 we created we called a buffer, of the path so basically, 353 00:20:25,210 --> 00:20:27,620 the idea is that the orbit has some information from what 354 00:20:27,620 --> 00:20:30,980 was going on before they touch, the actual contact. 355 00:20:30,980 --> 00:20:36,540 And if the object eventually ends up touching the tactile, 356 00:20:36,540 --> 00:20:41,900 it will be labeled as a positive event that 357 00:20:41,900 --> 00:20:45,580 will enforce the probability of the event 358 00:20:45,580 --> 00:20:47,960 the ending of touching the taxel. 359 00:20:47,960 --> 00:20:50,870 If not, for example, it might be that the object enters 360 00:20:50,870 --> 00:20:52,890 this receptive field, and in the end, 361 00:20:52,890 --> 00:20:56,030 ends up touching another taxel. 362 00:20:56,030 --> 00:20:57,960 This will be labeled as a negative. 363 00:20:57,960 --> 00:21:01,400 So at the end, we will have a set of positive and negative 364 00:21:01,400 --> 00:21:04,320 events a taxel can learn from. 365 00:21:04,320 --> 00:21:07,270 This is three dimensional space because the distance 366 00:21:07,270 --> 00:21:08,840 is three dimensional. 367 00:21:08,840 --> 00:21:13,370 And we narrowed it down to a one dimensional domain, 368 00:21:13,370 --> 00:21:15,290 by basically, taking the norm of the distance, 369 00:21:15,290 --> 00:21:19,460 but also the relative position of the object and the taxel. 370 00:21:19,460 --> 00:21:23,050 In order for us to be able to cope with the calibration 371 00:21:23,050 --> 00:21:26,420 errors that were amounting up to a couple of centimeters 372 00:21:26,420 --> 00:21:27,950 that were significant. 373 00:21:31,400 --> 00:21:34,210 One dimensional variable has been 374 00:21:34,210 --> 00:21:36,860 discretized into a set of bins. 375 00:21:36,860 --> 00:21:40,280 And for each bin, we computed the probability 376 00:21:40,280 --> 00:21:44,720 of an event belonging to that, of being touched. 377 00:21:44,720 --> 00:21:47,990 So the idea is that, at 20 centimeters, 378 00:21:47,990 --> 00:21:50,060 the probability of being touched would be lower 379 00:21:50,060 --> 00:21:51,420 than a zero centimeter. 380 00:21:51,420 --> 00:21:52,890 This is the intuitive idea. 381 00:21:56,410 --> 00:21:59,540 Over this one dimensional visualization, 382 00:21:59,540 --> 00:22:01,850 we used a partial window interpolation technique 383 00:22:01,850 --> 00:22:04,790 in order to provide us with the two dimensional function 384 00:22:04,790 --> 00:22:08,450 that, at the end, we give up a inactivation 385 00:22:08,450 --> 00:22:14,400 value that is proportional with the distance of the object. 386 00:22:14,400 --> 00:22:17,960 So as soon as the new object will enter the receptive field, 387 00:22:17,960 --> 00:22:22,700 I will have the taxel fire before being contacted. 388 00:22:26,090 --> 00:22:28,070 We did, basically, two experiments. 389 00:22:28,070 --> 00:22:31,940 Initially, we did a simulation in a mock lab in order 390 00:22:31,940 --> 00:22:35,090 to assess the convergence on the long term 391 00:22:35,090 --> 00:22:39,020 learning, one-shot learning behavior, 392 00:22:39,020 --> 00:22:43,880 to assess if our model was able to cope with noise, 393 00:22:43,880 --> 00:22:46,160 with the current calibration errors. 394 00:22:46,160 --> 00:22:48,010 And then, we went on the real robot. 395 00:22:48,010 --> 00:22:50,540 We presented them with different objects. 396 00:22:50,540 --> 00:22:54,500 And we were basically touching the robot 100 times 397 00:22:54,500 --> 00:22:57,780 in order to make it learn these presentations. 398 00:22:57,780 --> 00:23:00,440 So, trust me, I don't want to bother you 399 00:23:00,440 --> 00:23:03,260 with this kind of technicalities, 400 00:23:03,260 --> 00:23:06,106 but we did a lot of work. 401 00:23:06,106 --> 00:23:09,170 This is, basically, the math of the result. 402 00:23:09,170 --> 00:23:14,300 So, let me go on the second part, in which the main problem 403 00:23:14,300 --> 00:23:17,270 was for the robot to detect the object visually. 404 00:23:17,270 --> 00:23:21,140 In order for us to do that, we developed a 3D tracking 405 00:23:21,140 --> 00:23:25,010 algorithm, that was able to track a [INAUDIBLE] object, 406 00:23:25,010 --> 00:23:25,925 basically. 407 00:23:25,925 --> 00:23:29,240 To design, we used some software that 408 00:23:29,240 --> 00:23:31,970 was only available in the iCub software repository. 409 00:23:31,970 --> 00:23:36,350 The engine provides you with some basic algorithms 410 00:23:36,350 --> 00:23:38,600 that you can play with. 411 00:23:38,600 --> 00:23:42,500 And namely, we used a two dimensional optical flow 412 00:23:42,500 --> 00:23:45,780 made by Carlo and a 2D particle filter 413 00:23:45,780 --> 00:23:47,750 and a 3D stereo vision algorithm, 414 00:23:47,750 --> 00:23:50,690 that is basically the same as I was showing before 415 00:23:50,690 --> 00:23:53,660 during the recognition game. 416 00:23:53,660 --> 00:24:00,470 And this basically was feeding a 3-D camera 417 00:24:00,470 --> 00:24:04,980 to provide the robot estimation of the position of the object. 418 00:24:04,980 --> 00:24:10,120 So, the idea is that the motion detector from the optical flow 419 00:24:10,120 --> 00:24:15,550 act as a trigger for the subsequent pipeline, in which, 420 00:24:15,550 --> 00:24:18,340 basically, after a consistent enough motion 421 00:24:18,340 --> 00:24:27,800 in this optical flow module, this would be a template 422 00:24:27,800 --> 00:24:31,880 to be taught in the visual in the 2D visual world by this. 423 00:24:31,880 --> 00:24:35,960 Then, that this information is sent to the 3D depth map 424 00:24:35,960 --> 00:24:39,610 and this would be feeding the camera feature in order 425 00:24:39,610 --> 00:24:44,000 to provide us with the table representation because, 426 00:24:44,000 --> 00:24:47,840 obviously, the stereo system doesn't work that good 427 00:24:47,840 --> 00:24:49,980 in our context. 428 00:24:49,980 --> 00:24:52,970 And this, if it works-- 429 00:24:52,970 --> 00:24:55,730 no. 430 00:24:55,730 --> 00:24:57,280 OK. 431 00:24:57,280 --> 00:24:59,600 On my laptop, it works here. 432 00:24:59,600 --> 00:25:00,140 OK. 433 00:25:00,140 --> 00:25:00,682 Now it works. 434 00:25:00,682 --> 00:25:01,181 OK. 435 00:25:01,181 --> 00:25:01,990 This is the idea. 436 00:25:01,990 --> 00:25:03,970 So I was basically waving, moving 437 00:25:03,970 --> 00:25:06,340 the object in the beginning. 438 00:25:06,340 --> 00:25:07,210 OK. 439 00:25:07,210 --> 00:25:10,820 Then when it is detected, this pattern starts. 440 00:25:10,820 --> 00:25:13,237 And you can see here the tracking. 441 00:25:13,237 --> 00:25:14,320 This is the stereo vision. 442 00:25:14,320 --> 00:25:17,870 This the final outcome. 443 00:25:17,870 --> 00:25:22,370 This was used for the learning. 444 00:25:22,370 --> 00:25:25,880 We did a lot of iterations of these objects 445 00:25:25,880 --> 00:25:28,130 that are approaching the skin on different body parts. 446 00:25:28,130 --> 00:25:29,510 This is the graph. 447 00:25:29,510 --> 00:25:31,620 I don't want to talk about that. 448 00:25:31,620 --> 00:25:34,400 So let me start with the video. 449 00:25:34,400 --> 00:25:36,730 This is basically the skin. 450 00:25:36,730 --> 00:25:39,920 And this is the part that it was trained before. 451 00:25:39,920 --> 00:25:42,830 When there is a contact, there is activation here. 452 00:25:42,830 --> 00:25:44,840 You can see here the activation. 453 00:25:44,840 --> 00:25:49,970 And soon after, this thing worked also with one example. 454 00:25:49,970 --> 00:25:53,270 The taxel starts firing before the contact. 455 00:25:53,270 --> 00:25:57,080 And obviously, this is improved over the time. 456 00:25:57,080 --> 00:25:59,770 And it depends on the body part that is touched. 457 00:25:59,770 --> 00:26:02,750 For example, if I touch here, I'm coming from the top. 458 00:26:02,750 --> 00:26:08,430 So the representation starts firing mainly here. 459 00:26:08,430 --> 00:26:11,330 And this, obviously, depends on the specific body part. 460 00:26:11,330 --> 00:26:14,570 Now, I think that I'm going to touch the hand. 461 00:26:14,570 --> 00:26:20,092 And so after a while, you will have an activation on the hand. 462 00:26:20,092 --> 00:26:22,550 Obviously, I will have also some activation in the forearm, 463 00:26:22,550 --> 00:26:26,090 because I was getting closer to the forearm. 464 00:26:26,090 --> 00:26:28,330 And as an application of this, this one 465 00:26:28,330 --> 00:26:30,340 is simply a presentation. 466 00:26:30,340 --> 00:26:31,940 So it's not that usable. 467 00:26:31,940 --> 00:26:38,180 We basically exploited it in order to develop an avoidance-- 468 00:26:38,180 --> 00:26:39,940 a margin of safety around the body. 469 00:26:39,940 --> 00:26:42,260 Let's say if the taxel is firing, 470 00:26:42,260 --> 00:26:47,090 I would like it to go away from the object, 471 00:26:47,090 --> 00:26:50,330 assuming that this can be a potentially harmful object. 472 00:26:50,330 --> 00:26:52,130 And on the other way, I would like 473 00:26:52,130 --> 00:26:55,640 it to be able to reach with any body part 474 00:26:55,640 --> 00:26:57,640 the object under consideration. 475 00:26:57,640 --> 00:27:00,710 So to this end, we developed the avoidance and catching 476 00:27:00,710 --> 00:27:05,300 controller that was able to leverage on this distributed 477 00:27:05,300 --> 00:27:10,640 information and perform a sensor-based guidance 478 00:27:10,640 --> 00:27:13,750 of the model actions by means of this visual tactile 479 00:27:13,750 --> 00:27:14,930 associations. 480 00:27:14,930 --> 00:27:17,340 And this is basically how it works. 481 00:27:17,340 --> 00:27:20,530 So this is at the testing stage. 482 00:27:20,530 --> 00:27:22,700 So I already learned the representation. 483 00:27:22,700 --> 00:27:25,670 As soon as I get closer, the taxel 484 00:27:25,670 --> 00:27:28,980 starts firing, because of the probabilities I was learning. 485 00:27:28,980 --> 00:27:31,880 And the arm goes away. 486 00:27:31,880 --> 00:27:34,290 Obviously, the movement depends on the specific skin 487 00:27:34,290 --> 00:27:35,490 part that has been touched. 488 00:27:35,490 --> 00:27:39,920 If I'm touching here, the object will go away from here. 489 00:27:39,920 --> 00:27:42,050 If I'm coming from the top-- 490 00:27:42,050 --> 00:27:46,160 I think this one was doing from the top, yes-- 491 00:27:46,160 --> 00:27:49,640 the object will go away from the back. 492 00:27:49,640 --> 00:27:52,500 The object will be going a way from another direction. 493 00:27:52,500 --> 00:27:56,930 And the idea here is not to, basically, tackle the problem 494 00:27:56,930 --> 00:28:01,334 from a classical robotics approach. 495 00:28:01,334 --> 00:28:02,750 But the basic idea-- this behavior 496 00:28:02,750 --> 00:28:05,010 emerges from the learning. 497 00:28:05,010 --> 00:28:07,970 And the idea was very simple. 498 00:28:07,970 --> 00:28:11,120 We were basically looking at the taxel that we were firing. 499 00:28:11,120 --> 00:28:14,030 If they were firing enough, then we 500 00:28:14,030 --> 00:28:15,660 were recording their position. 501 00:28:15,660 --> 00:28:18,020 And we were doing, basically, a population 502 00:28:18,020 --> 00:28:20,450 coding that is a weighted average according 503 00:28:20,450 --> 00:28:23,110 to the activation and the prediction. 504 00:28:23,110 --> 00:28:24,920 We did that to both for the position 505 00:28:24,920 --> 00:28:26,150 of the taxel and the normal. 506 00:28:26,150 --> 00:28:29,150 So at the end if you have a bunch of tactiles here, 507 00:28:29,150 --> 00:28:33,590 we will end up with one point to go away from. 508 00:28:33,590 --> 00:28:37,930 And on the other side, the catching, the reaching, 509 00:28:37,930 --> 00:28:40,400 was basically the same, but in the opposite direction. 510 00:28:40,400 --> 00:28:41,990 So if I want to avoid, I do this. 511 00:28:41,990 --> 00:28:44,150 If I want to catch, I do this. 512 00:28:44,150 --> 00:28:47,250 Obviously, if you do it in the hand, 513 00:28:47,250 --> 00:28:50,330 this would be a standard robotic reaching. 514 00:28:50,330 --> 00:28:54,260 But this actually can be triggered also 515 00:28:54,260 --> 00:28:55,340 in different body parts. 516 00:28:55,340 --> 00:28:58,900 As you can see here, I get a virtual activation, and then 517 00:28:58,900 --> 00:29:00,990 the physical contact. 518 00:29:00,990 --> 00:29:09,320 And yes, basically, our design was to use the same controller 519 00:29:09,320 --> 00:29:12,290 for both of the behaviors. 520 00:29:15,910 --> 00:29:16,410 OK. 521 00:29:16,410 --> 00:29:19,460 This is also some technicalities that I don't want to show you. 522 00:29:19,460 --> 00:29:22,760 So in conclusion, the detector presented here 523 00:29:22,760 --> 00:29:25,100 is, to our knowledge, the first attempt 524 00:29:25,100 --> 00:29:28,610 at creating a decentralized, multisensory visual tactile 525 00:29:28,610 --> 00:29:32,550 representation for a robot and its nearby space 526 00:29:32,550 --> 00:29:36,770 by means of the distributed skin and interaction 527 00:29:36,770 --> 00:29:38,510 with the environment. 528 00:29:38,510 --> 00:29:40,629 One of the assets of our representation 529 00:29:40,629 --> 00:29:41,670 is that learning is fast. 530 00:29:41,670 --> 00:29:44,750 As you were seeing, it can learn, also, 531 00:29:44,750 --> 00:29:46,520 from one single example. 532 00:29:46,520 --> 00:29:50,090 It's in parallel for the whole body in the sense 533 00:29:50,090 --> 00:29:52,340 that every tactile learns independently. 534 00:29:52,340 --> 00:29:55,450 Its own representation is incremental in a sense 535 00:29:55,450 --> 00:29:57,670 that it converges toward a stable representation 536 00:29:57,670 --> 00:29:58,770 over the time. 537 00:29:58,770 --> 00:30:02,440 And importantly, it is adapted from experience. 538 00:30:02,440 --> 00:30:04,310 So basically, it can automatically 539 00:30:04,310 --> 00:30:06,595 compensate for errors in the model that, 540 00:30:06,595 --> 00:30:09,580 for humanoid robots, is one of the main problems 541 00:30:09,580 --> 00:30:12,930 when merging different modalities OK. 542 00:30:12,930 --> 00:30:13,510 Thank you. 543 00:30:13,510 --> 00:30:16,281 If you have any question, feel free to ask. 544 00:30:16,281 --> 00:30:17,780 RAFFAELLO CAMORIANO: I am Raffaello. 545 00:30:17,780 --> 00:30:20,000 And today, I'll talk to you about a little bit 546 00:30:20,000 --> 00:30:23,980 of my work on machine learning and robotics, 547 00:30:23,980 --> 00:30:29,380 in particular some subsets of machine learning 548 00:30:29,380 --> 00:30:31,450 which are the large scale learning 549 00:30:31,450 --> 00:30:33,410 and incremental learning. 550 00:30:33,410 --> 00:30:38,220 But what do we expect from our modern robot? 551 00:30:38,220 --> 00:30:41,925 And how can machine learning help out with this? 552 00:30:41,925 --> 00:30:46,045 Well, we expect modern robots to work 553 00:30:46,045 --> 00:30:49,510 in, particularly, unstructured environments 554 00:30:49,510 --> 00:30:52,270 which they have never seen before 555 00:30:52,270 --> 00:30:54,640 and to learn new tasks on the fly 556 00:30:54,640 --> 00:30:57,010 depending on the particular needs 557 00:30:57,010 --> 00:31:00,670 throughout the operation of the robot itself 558 00:31:00,670 --> 00:31:02,580 and across different modalities. 559 00:31:02,580 --> 00:31:04,570 For instance-- vision, of course, 560 00:31:04,570 --> 00:31:09,830 but also tactile sensing which is available on the iCub also 561 00:31:09,830 --> 00:31:12,370 proprioceptive sensing, including 562 00:31:12,370 --> 00:31:16,220 force sensing, [INAUDIBLE] and so on and so forth. 563 00:31:16,220 --> 00:31:19,570 And we want to do all of this throughout a very long time 564 00:31:19,570 --> 00:31:21,040 span potentially. 565 00:31:21,040 --> 00:31:24,250 Because we expect robots to be companions 566 00:31:24,250 --> 00:31:27,790 of humans in the real world operating for maybe 567 00:31:27,790 --> 00:31:30,370 years or more. 568 00:31:30,370 --> 00:31:32,840 And this poses a lot of challenges, 569 00:31:32,840 --> 00:31:35,650 especially from the computational point of view. 570 00:31:35,650 --> 00:31:38,350 And machine learning can actually 571 00:31:38,350 --> 00:31:43,600 help with this tackling these challenges. 572 00:31:43,600 --> 00:31:45,090 For instance, there are large scale 573 00:31:45,090 --> 00:31:52,820 learning methods, which are algorithms which can work 574 00:31:52,820 --> 00:31:56,950 with very large scale datasets. 575 00:31:56,950 --> 00:32:00,460 For instance, if we have millions of points gathered 576 00:32:00,460 --> 00:32:04,780 by the robot cameras throughout 10 days 577 00:32:04,780 --> 00:32:06,970 and we want to process them, well, 578 00:32:06,970 --> 00:32:09,490 if we use standard machine learning methods, 579 00:32:09,490 --> 00:32:12,670 that will be a very difficult problem 580 00:32:12,670 --> 00:32:20,410 to solve if we don't use, for instance, randomizing methods 581 00:32:20,410 --> 00:32:23,890 and so on and so forth. 582 00:32:23,890 --> 00:32:30,890 Machine learning also has incremental algorithms, 583 00:32:30,890 --> 00:32:37,030 which can allow the learned model to be updated 584 00:32:37,030 --> 00:32:45,790 as new previously unseen features are presented 585 00:32:45,790 --> 00:32:47,920 to the agent. 586 00:32:47,920 --> 00:32:53,140 And also, there is a subfield of transfer learning 587 00:32:53,140 --> 00:32:58,840 which allows knowledge learned for a particular task 588 00:32:58,840 --> 00:33:04,180 to be used for serving another related task without the need 589 00:33:04,180 --> 00:33:10,630 for seeing many new examples for the new task. 590 00:33:10,630 --> 00:33:15,920 So my main research focuses are in machine learning. 591 00:33:15,920 --> 00:33:19,360 I work especially in large scale learning methods, 592 00:33:19,360 --> 00:33:23,860 incremental learning, and in the design 593 00:33:23,860 --> 00:33:31,760 of algorithms which allow for computational and accuracy 594 00:33:31,760 --> 00:33:34,940 trade-offs. 595 00:33:34,940 --> 00:33:39,050 I will explain this a bit more later. 596 00:33:39,050 --> 00:33:44,020 And as concerns robotic applications, 597 00:33:44,020 --> 00:33:49,120 I work with Guilia, Carlo and others 598 00:33:49,120 --> 00:33:52,190 on incremental object recognition, 599 00:33:52,190 --> 00:33:56,950 so in a setting in which the robot is presented new objects 600 00:33:56,950 --> 00:33:58,540 throughout a long time span. 601 00:33:58,540 --> 00:34:02,560 And it has to learn them on the fly. 602 00:34:02,560 --> 00:34:12,730 And also, I'm working in a system identification setting, 603 00:34:12,730 --> 00:34:15,310 which I will explain later, related 604 00:34:15,310 --> 00:34:18,620 to the motion of the robot. 605 00:34:18,620 --> 00:34:23,920 So this is one of the works which 606 00:34:23,920 --> 00:34:26,380 has occupied my last year. 607 00:34:26,380 --> 00:34:31,210 And it is related to large scale learning. 608 00:34:31,210 --> 00:34:38,199 So if we consider that we may have a very large n, which 609 00:34:38,199 --> 00:34:42,650 is a number of examples we have access to, 610 00:34:42,650 --> 00:34:44,710 in the setting of kernel methods, 611 00:34:44,710 --> 00:34:49,179 we may have to store a huge matrix, the matrix K, which 612 00:34:49,179 --> 00:34:54,889 is n by n, which could be simply impossible to store. 613 00:34:54,889 --> 00:35:01,340 So there are randomized methods, like the Nystrom method, which 614 00:35:01,340 --> 00:35:09,190 enable to compute a low rank approximation of the kernel 615 00:35:09,190 --> 00:35:23,960 metrics simply by throwing a few points m at random, 616 00:35:23,960 --> 00:35:30,966 a few samples at random, and building the metrics K and m, 617 00:35:30,966 --> 00:35:32,090 which is just much smaller. 618 00:35:32,090 --> 00:35:34,850 Because m is much smaller than n. 619 00:35:34,850 --> 00:35:41,060 And this is a well-known method in machine learning. 620 00:35:41,060 --> 00:35:47,470 But we tried to see it from a different point of view 621 00:35:47,470 --> 00:35:47,980 than usual. 622 00:35:47,980 --> 00:35:51,890 Usually, this is seen just from a computational point of view 623 00:35:51,890 --> 00:36:01,570 in order to fit a difficult problem inside computers 624 00:36:01,570 --> 00:36:09,860 with limited capabilities while we proposed 625 00:36:09,860 --> 00:36:14,880 to see the Nystrom approximation as regularization 626 00:36:14,880 --> 00:36:17,950 of operation itself. 627 00:36:17,950 --> 00:36:27,250 So if you can see this, the usual way in which the Nystrom 628 00:36:27,250 --> 00:36:31,520 method is applied, for instance, with kernel regularized least 629 00:36:31,520 --> 00:36:34,160 squares. 630 00:36:34,160 --> 00:36:37,790 The parameter m, so the number of examples 631 00:36:37,790 --> 00:36:43,340 we are taking at random, is usually taken 632 00:36:43,340 --> 00:36:46,820 as large as possible in order just 633 00:36:46,820 --> 00:36:51,850 to fit in the memory of the available machines. 634 00:36:51,850 --> 00:36:57,200 While, actually, after choosing a large m, 635 00:36:57,200 --> 00:37:00,085 it is often necessary to regularize 636 00:37:00,085 --> 00:37:03,520 with deep neural regularization, for instance. 637 00:37:03,520 --> 00:37:09,290 And this sounds a bit like a waste of time and memory. 638 00:37:09,290 --> 00:37:14,780 Because, actually, what regularization, roughly 639 00:37:14,780 --> 00:37:20,270 speaking, does is to discard the irrelevant eigen components 640 00:37:20,270 --> 00:37:21,920 of the kernel metrics. 641 00:37:21,920 --> 00:37:24,650 So we observe that we can do this 642 00:37:24,650 --> 00:37:34,360 by just less random examples, so having a smaller model which 643 00:37:34,360 --> 00:37:38,390 can be computed more efficiently and without having 644 00:37:38,390 --> 00:37:42,580 to regularize again later. 645 00:37:42,580 --> 00:37:47,150 So m, the number of examples which are used, 646 00:37:47,150 --> 00:37:50,450 controls both the regularization and 647 00:37:50,450 --> 00:37:53,850 the computational complexity of our algorithm. 648 00:37:53,850 --> 00:37:57,650 This is very useful in a robotic setting in which we 649 00:37:57,650 --> 00:38:00,230 have to deal with lots of data. 650 00:38:00,230 --> 00:38:08,340 As regards to the incremental objects recognition task, 651 00:38:08,340 --> 00:38:12,840 this is another project I'm working on. 652 00:38:12,840 --> 00:38:16,760 And imagine that the robot has to work 653 00:38:16,760 --> 00:38:20,810 in an unknown environment, and it is presented novel objects 654 00:38:20,810 --> 00:38:21,785 on the fly. 655 00:38:21,785 --> 00:38:30,770 And it has to update its object recognition model 656 00:38:30,770 --> 00:38:32,990 in an efficient way without retraining 657 00:38:32,990 --> 00:38:36,650 from scratch every time a new object arrives. 658 00:38:36,650 --> 00:38:41,750 So this can be done easily by a slight modification 659 00:38:41,750 --> 00:38:44,130 of the regularized least squares algorithm 660 00:38:44,130 --> 00:38:45,700 and proper reweighting. 661 00:38:45,700 --> 00:38:48,940 An open question is how to change the regularization 662 00:38:48,940 --> 00:38:50,510 as n grows. 663 00:38:50,510 --> 00:38:54,230 Because we didn't find yet a way to efficiently 664 00:38:54,230 --> 00:38:58,880 update regularization parameter in this case. 665 00:38:58,880 --> 00:39:01,250 So we are still working on this. 666 00:39:01,250 --> 00:39:07,295 The last project I'll talk about is more related to, let's see, 667 00:39:07,295 --> 00:39:09,790 physics and motion. 668 00:39:09,790 --> 00:39:17,530 So we have an arbitrary limb of the robot, 669 00:39:17,530 --> 00:39:19,250 for instance, the arm. 670 00:39:19,250 --> 00:39:30,380 And our task is to learn a model which can provide 671 00:39:30,380 --> 00:39:31,750 an interesting dynamics model. 672 00:39:31,750 --> 00:39:36,220 So it can predict the inner forces of the arm 673 00:39:36,220 --> 00:39:37,100 during motion. 674 00:39:37,100 --> 00:39:44,010 This is useful, for instance, in a contact detection setting. 675 00:39:44,010 --> 00:39:48,050 So when the sensor readings are different from the group 676 00:39:48,050 --> 00:39:51,310 predicted one, that means that there may be a contact. 677 00:39:51,310 --> 00:39:59,570 Or for external force estimation or, for example, 678 00:39:59,570 --> 00:40:04,770 for the identification of the mass of a manipulated object. 679 00:40:04,770 --> 00:40:11,700 So we have some challenged for this project. 680 00:40:11,700 --> 00:40:15,545 We have to devise a model which could be interpretable, 681 00:40:15,545 --> 00:40:20,970 so in which the rigid body dynamic parameter would 682 00:40:20,970 --> 00:40:24,870 be understandable and intelligible 683 00:40:24,870 --> 00:40:27,780 for controlled purposes. 684 00:40:27,780 --> 00:40:30,870 And we wanted this model to be more 685 00:40:30,870 --> 00:40:34,790 accurate than standard multibody dynamics, rigid body dynamics 686 00:40:34,790 --> 00:40:36,540 model. 687 00:40:36,540 --> 00:40:42,220 And also, we want to adapt to changing conditions 688 00:40:42,220 --> 00:40:43,360 throughout time. 689 00:40:43,360 --> 00:40:49,230 For instance, during the operation of the robot, 690 00:40:49,230 --> 00:40:56,040 after one hour, the changes in temperature 691 00:40:56,040 --> 00:41:00,830 determine a change also in the dynamic properties 692 00:41:00,830 --> 00:41:04,030 of the mechanical properties of the arm. 693 00:41:04,030 --> 00:41:12,910 And we want to accommodate for this in an incremental way. 694 00:41:12,910 --> 00:41:14,920 So this is what we did. 695 00:41:14,920 --> 00:41:17,950 We implemented a semi-parametric model 696 00:41:17,950 --> 00:41:21,870 which the first part which has priority 697 00:41:21,870 --> 00:41:25,710 is a simple incremental parametric model. 698 00:41:25,710 --> 00:41:31,662 And then we used random features for building 699 00:41:31,662 --> 00:41:36,120 non-parametric incremental model which can 700 00:41:36,120 --> 00:41:38,120 be updated in an efficient way. 701 00:41:38,120 --> 00:41:40,120 And we shown with this real experiment 702 00:41:40,120 --> 00:41:44,720 that the semi-parametric model worked 703 00:41:44,720 --> 00:41:47,710 as well as the non-parametric one. 704 00:41:47,710 --> 00:41:49,970 But it's faster to converge, because it 705 00:41:49,970 --> 00:41:55,230 has an initial knowledge about the physics of the arm. 706 00:41:55,230 --> 00:42:00,450 And it is also better than the fully parametric one, 707 00:42:00,450 --> 00:42:04,140 because it also models, for example, 708 00:42:04,140 --> 00:42:08,680 dynamical effect due to deflectability of the body. 709 00:42:08,680 --> 00:42:15,640 And dynamic deflectors are usually not modeled 710 00:42:15,640 --> 00:42:18,704 by rigid body dynamic models. 711 00:42:18,704 --> 00:42:21,000 OK. 712 00:42:21,000 --> 00:42:25,940 Another thing I'm doing is maintaining the Grand Unified 713 00:42:25,940 --> 00:42:27,860 Regularized Least Squares library, 714 00:42:27,860 --> 00:42:32,750 which is a library for regularized least squares, 715 00:42:32,750 --> 00:42:34,400 of course. 716 00:42:34,400 --> 00:42:36,161 It supports a large scale dataset. 717 00:42:36,161 --> 00:42:40,350 This was developed in joint exchange between MIT and IIT 718 00:42:40,350 --> 00:42:45,150 some years ago by others, not by me. 719 00:42:45,150 --> 00:42:49,296 And it has a MATLAB and a C++ interface. 720 00:42:49,296 --> 00:42:54,670 If you want to have a look at how these methods work, 721 00:42:54,670 --> 00:42:56,730 I suggest you to try out tutorials 722 00:42:56,730 --> 00:42:59,740 which are available on GitHub. 723 00:42:59,740 --> 00:43:00,950 GUILIA PASQUALE: I'm Guilia. 724 00:43:00,950 --> 00:43:05,310 And I work on the iCub robot with my colleagues, 725 00:43:05,310 --> 00:43:07,490 especially on vision and, in particular, 726 00:43:07,490 --> 00:43:10,310 on visual recognition. 727 00:43:10,310 --> 00:43:13,650 I work under the supervision of Lorenzo Natale and Lorenzo 728 00:43:13,650 --> 00:43:14,510 Rosasco. 729 00:43:14,510 --> 00:43:19,210 Both will be here for a few days in the following weeks. 730 00:43:19,210 --> 00:43:21,680 And the work that I'm going to present 731 00:43:21,680 --> 00:43:23,870 has been done in collaboration with Carlo 732 00:43:23,870 --> 00:43:29,130 and also Francesca Odone from the University of Genoa. 733 00:43:29,130 --> 00:43:33,770 So in the last couple of years, computer vision methods 734 00:43:33,770 --> 00:43:37,310 based on deep convolution on neural networks 735 00:43:37,310 --> 00:43:40,790 have achieved a remarkable performance 736 00:43:40,790 --> 00:43:43,910 in tasks such as large scale image 737 00:43:43,910 --> 00:43:46,460 classification and retrieval. 738 00:43:46,460 --> 00:43:49,550 And the extreme success of these methods 739 00:43:49,550 --> 00:43:52,430 is mainly due to the increasing availability 740 00:43:52,430 --> 00:43:55,490 of all these larger datasets. 741 00:43:55,490 --> 00:43:59,270 And in particular, I'm referring to the ImageNet one, which 742 00:43:59,270 --> 00:44:05,090 is composed by millions of examples labeled into thousands 743 00:44:05,090 --> 00:44:08,020 of categories through crowd sourcing methods 744 00:44:08,020 --> 00:44:10,970 such as the Amazon Turk. 745 00:44:10,970 --> 00:44:15,050 And in particular, the increased data availability 746 00:44:15,050 --> 00:44:18,560 to gather with the increased computational power 747 00:44:18,560 --> 00:44:22,670 has allowed to train deep networks characterized 748 00:44:22,670 --> 00:44:27,080 by millions of parameters in a supervised way 749 00:44:27,080 --> 00:44:31,520 from the image up to the final label 750 00:44:31,520 --> 00:44:34,770 through the back propagation algorithm. 751 00:44:34,770 --> 00:44:40,830 And this has allowed to mark a breakthrough-- in particular, 752 00:44:40,830 --> 00:44:46,220 in 2012 when Alex Krizhevsky proposed 753 00:44:46,220 --> 00:44:50,580 for the first time of a network of this kind trained 754 00:44:50,580 --> 00:44:53,780 on ImageNet dataset and definitely won 755 00:44:53,780 --> 00:44:56,870 the ImageNet large scale user recognition 756 00:44:56,870 --> 00:44:59,000 challenge in this way. 757 00:44:59,000 --> 00:45:02,540 And the trend has been confirmed in the following years. 758 00:45:02,540 --> 00:45:06,980 So that nowadays problems such as large scale image 759 00:45:06,980 --> 00:45:10,670 classification or detection are usually 760 00:45:10,670 --> 00:45:15,350 tackled following this deep learning approach. 761 00:45:15,350 --> 00:45:20,150 And not only, it has been also demonstrated at least 762 00:45:20,150 --> 00:45:21,752 empirically. 763 00:45:21,752 --> 00:45:23,070 Oh, I'm sorry. 764 00:45:23,070 --> 00:45:25,320 Maybe this is not particularly clear. 765 00:45:25,320 --> 00:45:31,920 But this is the Krizhevsky Network. 766 00:45:31,920 --> 00:45:36,440 Models of networks of this kind trained on large datasets, 767 00:45:36,440 --> 00:45:42,140 such as the ImageNet one, do provide also 768 00:45:42,140 --> 00:45:45,290 very good general and powerful image 769 00:45:45,290 --> 00:45:51,500 descriptors to be applied also on other tasks and datasets. 770 00:45:51,500 --> 00:45:54,110 In particular, it is possible to use 771 00:45:54,110 --> 00:46:00,000 a convolutional neural network trained on ImageNet dataset, 772 00:46:00,000 --> 00:46:04,190 feed it with images, and using it 773 00:46:04,190 --> 00:46:08,660 as a black box extracting the vectorial representation 774 00:46:08,660 --> 00:46:11,810 of the incoming images as the output of one 775 00:46:11,810 --> 00:46:14,600 of the intermediate layers. 776 00:46:14,600 --> 00:46:18,640 Or even better, it is possible to start from a network 777 00:46:18,640 --> 00:46:21,410 model trained on the ImageNet dataset 778 00:46:21,410 --> 00:46:27,290 and fine tune its parameters on a new dataset for a new task 779 00:46:27,290 --> 00:46:31,940 and achieving and surpassing the state of the art-- 780 00:46:31,940 --> 00:46:37,400 for example, also in the Pascal dataset and other tasks-- 781 00:46:37,400 --> 00:46:38,600 following this approach. 782 00:46:41,330 --> 00:46:48,240 So it is natural to ask at this point, why? 783 00:46:48,240 --> 00:46:53,630 Instead, in robotics, providing robots 784 00:46:53,630 --> 00:46:58,610 with robust and accurate visual recognition capabilities 785 00:46:58,610 --> 00:47:03,110 in the real world is still one of the greatest challenge that 786 00:47:03,110 --> 00:47:06,260 prevents the use of autonomous agents 787 00:47:06,260 --> 00:47:10,030 for concrete applications. 788 00:47:10,030 --> 00:47:14,540 An actually, this is a problem that is not only related 789 00:47:14,540 --> 00:47:20,190 to the iCub platform, but it is also a limiting factor 790 00:47:20,190 --> 00:47:24,470 that the performance of the latest 791 00:47:24,470 --> 00:47:27,050 robotics platforms, such as the ones that 792 00:47:27,050 --> 00:47:30,880 have been participating, for example, to the DARPA robotics 793 00:47:30,880 --> 00:47:32,980 challenge. 794 00:47:32,980 --> 00:47:40,700 Indeed, as you can see here, robots 795 00:47:40,700 --> 00:47:48,500 are still either highly tele-operated or complex 796 00:47:48,500 --> 00:47:50,491 methods. 797 00:47:50,491 --> 00:47:55,520 To, for example, map the 3D structure on the environment 798 00:47:55,520 --> 00:48:00,080 and label it a priori must be implemented in order 799 00:48:00,080 --> 00:48:03,080 to enable autonomous agents to act in very 800 00:48:03,080 --> 00:48:06,650 controlled environments. 801 00:48:06,650 --> 00:48:13,460 So we decided to focus on very simple settings 802 00:48:13,460 --> 00:48:15,750 where, in principle, computer vision 803 00:48:15,750 --> 00:48:20,030 methods as the ones that I've been describing you 804 00:48:20,030 --> 00:48:23,480 should be at least-- 805 00:48:23,480 --> 00:48:25,550 well, should provide very good performances. 806 00:48:25,550 --> 00:48:28,420 Because here the setting is pretty simple. 807 00:48:28,420 --> 00:48:31,310 And we tried to evaluate the performance 808 00:48:31,310 --> 00:48:34,290 of these deep learning methods in these settings. 809 00:48:34,290 --> 00:48:37,260 Here you can see the robot, that one, 810 00:48:37,260 --> 00:48:39,740 standing in front of a table. 811 00:48:39,740 --> 00:48:43,610 There is a human which gives verbal instruction 812 00:48:43,610 --> 00:48:48,270 to the robot and also, for example in this case, 813 00:48:48,270 --> 00:48:54,470 the label of the object to be either learned or recognized. 814 00:48:54,470 --> 00:49:00,830 And the robot can focus his attention on potential objects 815 00:49:00,830 --> 00:49:03,920 through bottom up segmentation techniques-- for example, 816 00:49:03,920 --> 00:49:08,830 in this case, color or the other saliency-based segmentation 817 00:49:08,830 --> 00:49:09,690 methods. 818 00:49:09,690 --> 00:49:12,230 I'm not going into the detail of this setting, 819 00:49:12,230 --> 00:49:16,760 because you would see a demo after my talk of this. 820 00:49:16,760 --> 00:49:20,150 Another setting that we are considering 821 00:49:20,150 --> 00:49:22,130 is similar to the previous one. 822 00:49:22,130 --> 00:49:25,800 But this time, there is a human standing in front of the robot. 823 00:49:25,800 --> 00:49:27,380 And there is no table. 824 00:49:27,380 --> 00:49:31,370 And the human, he's holding the objects in his hands 825 00:49:31,370 --> 00:49:35,150 and is showing one object after the other 826 00:49:35,150 --> 00:49:38,240 to the robot providing the verbal annotation 827 00:49:38,240 --> 00:49:40,800 for that object. 828 00:49:40,800 --> 00:49:45,260 The robot in this way, for example 829 00:49:45,260 --> 00:49:49,250 here, can exploit motion detection techniques 830 00:49:49,250 --> 00:49:53,330 in order to localize the object in the visual field 831 00:49:53,330 --> 00:49:55,130 and focus on it. 832 00:49:55,130 --> 00:49:58,820 The robot tracks the object continuously, 833 00:49:58,820 --> 00:50:02,270 acquiring in this way cropped the frames 834 00:50:02,270 --> 00:50:06,230 around the object that are the training examples that 835 00:50:06,230 --> 00:50:10,470 will be used to learn the object's appearance. 836 00:50:10,470 --> 00:50:17,060 So in general, this is the recognition pipeline 837 00:50:17,060 --> 00:50:22,610 that is implemented to perform both the two behaviors 838 00:50:22,610 --> 00:50:24,680 that I've been showing you. 839 00:50:24,680 --> 00:50:27,540 As you can see, the input is the image, 840 00:50:27,540 --> 00:50:31,220 the stream of images from one of the two cameras. 841 00:50:31,220 --> 00:50:34,190 Then there is the verbal supervision of the teacher. 842 00:50:34,190 --> 00:50:38,030 Then there are segmentation techniques 843 00:50:38,030 --> 00:50:41,690 in order to crop region of interest 844 00:50:41,690 --> 00:50:46,550 from the incoming frame and feed this crop 845 00:50:46,550 --> 00:50:49,220 to a convolutional neural network. 846 00:50:49,220 --> 00:50:54,620 In this case, we are using the famous Krizhevsky model. 847 00:50:54,620 --> 00:50:58,310 Then we encode each incoming crop 848 00:50:58,310 --> 00:51:01,940 in a vector as the output of one of the latest 849 00:51:01,940 --> 00:51:03,800 layers of the network. 850 00:51:03,800 --> 00:51:07,485 And we feed all these vectors to a linear classifier, 851 00:51:07,485 --> 00:51:09,860 which is linear because, in principle, the representation 852 00:51:09,860 --> 00:51:14,360 that we are extracting is good enough for the discrimination 853 00:51:14,360 --> 00:51:16,890 that we want to perform. 854 00:51:16,890 --> 00:51:19,250 And so the classifier uses these incoming vectors 855 00:51:19,250 --> 00:51:24,800 either as examples for the training sector 856 00:51:24,800 --> 00:51:33,950 or assigns to each vector the prediction label. 857 00:51:33,950 --> 00:51:38,330 And the output is an histogram with the probabilities 858 00:51:38,330 --> 00:51:40,370 of all the classes. 859 00:51:40,370 --> 00:51:45,890 And the final outcome is the one with the highest probability. 860 00:51:45,890 --> 00:51:49,440 And the histogram is updated in real time. 861 00:51:49,440 --> 00:51:52,280 So this pipeline can be used either 862 00:51:52,280 --> 00:51:58,280 for one or the other settings that have been described you. 863 00:51:58,280 --> 00:52:05,330 So in particular, we started from trying 864 00:52:05,330 --> 00:52:13,490 to list some requirements that according to us 865 00:52:13,490 --> 00:52:17,510 are fundamental in order to implement 866 00:52:17,510 --> 00:52:22,730 a sort of ideal robotic visual recognition system. 867 00:52:22,730 --> 00:52:28,430 And these requirements are usually not considered 868 00:52:28,430 --> 00:52:31,730 by typical computer vision methods as the ones that 869 00:52:31,730 --> 00:52:36,860 have been described you, but are the same fundamental 870 00:52:36,860 --> 00:52:39,730 if we want to achieve human level 871 00:52:39,730 --> 00:52:44,000 performances in the settings that I've been showing you. 872 00:52:44,000 --> 00:52:45,580 For example, first of all, the system 873 00:52:45,580 --> 00:52:47,980 should be, as you have seen, as much 874 00:52:47,980 --> 00:52:51,250 as possible self-supervised, meaning that there must 875 00:52:51,250 --> 00:52:53,530 be techniques in order to focus that robot's 876 00:52:53,530 --> 00:52:55,870 attention on the object of interest 877 00:52:55,870 --> 00:52:59,140 and isolate them from the visual field. 878 00:52:59,140 --> 00:53:01,600 Then hopefully, we would like to come out 879 00:53:01,600 --> 00:53:04,210 with a system that is reliable and robust 880 00:53:04,210 --> 00:53:06,880 to the variations in the environment 881 00:53:06,880 --> 00:53:10,830 and also in the object's appearance. 882 00:53:10,830 --> 00:53:14,930 Then also, as we are in the real world, 883 00:53:14,930 --> 00:53:18,310 we would like a system able to exploit 884 00:53:18,310 --> 00:53:21,260 the contextual information that is available. 885 00:53:21,260 --> 00:53:23,170 For example-- the fact that we are actually 886 00:53:23,170 --> 00:53:24,190 dealing with videos. 887 00:53:24,190 --> 00:53:27,760 So the frames are temporarily correlated. 888 00:53:27,760 --> 00:53:29,740 And we are not dealing with images in the wild, 889 00:53:29,740 --> 00:53:31,990 as the ImageNet case. 890 00:53:31,990 --> 00:53:34,252 And finally, as Raffaello was mentioning, 891 00:53:34,252 --> 00:53:35,710 we would like to have a system that 892 00:53:35,710 --> 00:53:39,310 is able to learn incrementally to build always 893 00:53:39,310 --> 00:53:43,310 richer models of the object through time. 894 00:53:43,310 --> 00:53:47,380 So we decided to evaluate this recognition pipeline 895 00:53:47,380 --> 00:53:51,060 according to the criteria that have been described you. 896 00:53:54,280 --> 00:54:00,290 And in order to provide reproducibility to our study, 897 00:54:00,290 --> 00:54:03,370 we decided to acquire a dataset on which 898 00:54:03,370 --> 00:54:06,080 to perform our analysis. 899 00:54:06,080 --> 00:54:09,730 However, we would like also to be confident enough 900 00:54:09,730 --> 00:54:13,960 that the result that we obtain on our benchmark 901 00:54:13,960 --> 00:54:18,400 will hold also in the real usage of our system. 902 00:54:18,400 --> 00:54:20,050 And this is the reason why we decided 903 00:54:20,050 --> 00:54:24,160 to acquire our dataset in the same application setting where 904 00:54:24,160 --> 00:54:27,190 the robot usually operates. 905 00:54:27,190 --> 00:54:29,780 So this is the iCubWork28 dataset 906 00:54:29,780 --> 00:54:32,780 that I acquired last year. 907 00:54:32,780 --> 00:54:35,620 As you can see, it's composed by 28 objects 908 00:54:35,620 --> 00:54:38,970 divided into seven categories and four 909 00:54:38,970 --> 00:54:41,120 instances per category. 910 00:54:41,120 --> 00:54:43,860 And I acquired it for four different days 911 00:54:43,860 --> 00:54:46,720 in order to test also incremental learning 912 00:54:46,720 --> 00:54:49,560 capabilities. 913 00:54:49,560 --> 00:54:52,810 The dataset is available on the IIT website. 914 00:54:52,810 --> 00:54:54,740 And you can also use it, for example, 915 00:54:54,740 --> 00:54:58,547 for the project of trust five if you are interested. 916 00:55:01,470 --> 00:55:03,820 And this is an example of the kind of videos 917 00:55:03,820 --> 00:55:09,190 that I acquired considering one of the 28 objects. 918 00:55:09,190 --> 00:55:13,420 There are four videos for the train, four for the test, 919 00:55:13,420 --> 00:55:16,960 acquired in four different conditions. 920 00:55:16,960 --> 00:55:21,010 The object is undergoing random transformations, mainly limited 921 00:55:21,010 --> 00:55:23,290 to 3D rotations. 922 00:55:23,290 --> 00:55:26,320 And as you can see, the difference between the days 923 00:55:26,320 --> 00:55:30,130 is mainly limited to the fact that we are just 924 00:55:30,130 --> 00:55:33,460 changing the conditions in the environment-- 925 00:55:33,460 --> 00:55:39,700 for example, the background or the lighting conditions. 926 00:55:39,700 --> 00:55:42,850 And we acquired eight videos for each of the 28 objects 927 00:55:42,850 --> 00:55:45,520 that I show you. 928 00:55:45,520 --> 00:55:49,840 So first of all, we tried to find a measure, 929 00:55:49,840 --> 00:55:52,960 as I was saying before, to quantify 930 00:55:52,960 --> 00:55:56,200 the confidence with which we can expect 931 00:55:56,200 --> 00:55:59,590 that the results and the performance that we observe 932 00:55:59,590 --> 00:56:02,040 on these benchmarks will also hold 933 00:56:02,040 --> 00:56:04,470 in the real usage of the system. 934 00:56:04,470 --> 00:56:08,470 And to do this, first of all, we focused only 935 00:56:08,470 --> 00:56:10,570 on object identification for the moment. 936 00:56:10,570 --> 00:56:14,650 So the task is to discriminate the specific instances 937 00:56:14,650 --> 00:56:19,180 of objects among the pool of 28. 938 00:56:19,180 --> 00:56:28,240 And we decided to estimate for an increasing number of objects 939 00:56:28,240 --> 00:56:34,630 to be discriminated from 2 to 28 the empirical probability 940 00:56:34,630 --> 00:56:40,870 distribution of the identification accuracy 941 00:56:40,870 --> 00:56:44,440 that we can observe statistically 942 00:56:44,440 --> 00:56:46,510 for a fixed number of objects. 943 00:56:46,510 --> 00:56:49,150 That is depicted here in the form of box plots. 944 00:56:51,740 --> 00:56:56,300 And also, we estimated for each fixed number of objects 945 00:56:56,300 --> 00:57:00,800 to be discriminated the minimum accuracy 946 00:57:00,800 --> 00:57:04,970 that we can expect to achieve with increasing confidence 947 00:57:04,970 --> 00:57:06,920 levels. 948 00:57:06,920 --> 00:57:08,690 And this is a sort of data sheet. 949 00:57:08,690 --> 00:57:15,080 The idea is to provide an idea to an hypothetical user 950 00:57:15,080 --> 00:57:18,440 of the robot of the identification accuracy 951 00:57:18,440 --> 00:57:22,940 that can be expected given a certain pool of objects 952 00:57:22,940 --> 00:57:25,700 to be discriminated. 953 00:57:25,700 --> 00:57:28,110 So the second point that I'll briefly describe you 954 00:57:28,110 --> 00:57:30,410 is the fact that we investigated the effect 955 00:57:30,410 --> 00:57:38,190 of having a more or less precise segmentation in the image. 956 00:57:38,190 --> 00:57:41,960 So we evaluated the task of identifying the 28 957 00:57:41,960 --> 00:57:47,090 objects with different levels of segmentation 958 00:57:47,090 --> 00:57:50,455 starting from the whole image up to a very precise amount 959 00:57:50,455 --> 00:57:51,860 of segmentation of the objects. 960 00:57:51,860 --> 00:57:56,000 It can be seen that, indeed, even 961 00:57:56,000 --> 00:57:59,490 if in principle these convolutional networks are 962 00:57:59,490 --> 00:58:03,230 trained to classify objects in the world image 963 00:58:03,230 --> 00:58:05,560 as it is in the ImageNet dataset, 964 00:58:05,560 --> 00:58:08,370 it is still true that in our case 965 00:58:08,370 --> 00:58:12,200 we observed that there is still a large benefit from having 966 00:58:12,200 --> 00:58:15,950 a fine-grained segmentation. 967 00:58:15,950 --> 00:58:20,170 So probably the network is not able to completely discard 968 00:58:20,170 --> 00:58:23,220 the new relevant information that is in the background. 969 00:58:23,220 --> 00:58:25,550 So this is a possible interesting direction 970 00:58:25,550 --> 00:58:27,680 of research. 971 00:58:27,680 --> 00:58:34,460 And finally, the last point that I decided to tell you-- 972 00:58:34,460 --> 00:58:36,110 I will skip on the incremental part, 973 00:58:36,110 --> 00:58:40,670 because it's an ongoing work that I'm doing with Raffaello-- 974 00:58:40,670 --> 00:58:44,090 is about the exploitation of the temporal contextual 975 00:58:44,090 --> 00:58:46,850 information. 976 00:58:46,850 --> 00:58:50,930 Here, you can see the same kind of plot 977 00:58:50,930 --> 00:58:53,640 that I showed you before. 978 00:58:53,640 --> 00:58:56,270 So the task is object identification, 979 00:58:56,270 --> 00:58:58,330 increasing number of objects. 980 00:58:58,330 --> 00:59:02,660 And the dot black line represent the accuracy 981 00:59:02,660 --> 00:59:06,520 that you obtain if you consider, as you 982 00:59:06,520 --> 00:59:09,800 were asking before, the classification of each 983 00:59:09,800 --> 00:59:12,320 frame independently. 984 00:59:12,320 --> 00:59:14,870 So you can see that in this case the accuracy that you get 985 00:59:14,870 --> 00:59:18,620 is pretty low, considering that we have to discriminate 986 00:59:18,620 --> 00:59:21,390 between only 28 objects. 987 00:59:21,390 --> 00:59:24,860 However, it is also true that as soon 988 00:59:24,860 --> 00:59:34,100 as you start considering instead of the prediction given looking 989 00:59:34,100 --> 00:59:39,320 only at the current frame, the most frequent prediction 990 00:59:39,320 --> 00:59:43,370 occurred in a temporal window, so in the previous, 991 00:59:43,370 --> 00:59:46,190 let's say, 50 frames. 992 00:59:46,190 --> 00:59:50,150 You can boost your condition accuracy a lot. 993 00:59:50,150 --> 00:59:54,050 As you can see here, from green to red, 994 00:59:54,050 --> 00:59:56,920 increasing the length of the temporal window 995 00:59:56,920 --> 00:59:59,480 increases the recognition accuracy that you get. 996 00:59:59,480 --> 01:00:00,980 This is a very simple approach. 997 01:00:00,980 --> 01:00:05,360 But it is showing that actually it is relevant in the fact 998 01:00:05,360 --> 01:00:08,390 that you are actually dealing with videos instead of images 999 01:00:08,390 --> 01:00:09,590 in the wild. 1000 01:00:09,590 --> 01:00:13,460 And it is another direction of research. 1001 01:00:13,460 --> 01:00:16,620 So finally, in the last part of my talk, 1002 01:00:16,620 --> 01:00:18,470 I would like to tell you about the work 1003 01:00:18,470 --> 01:00:24,050 that I'm actually doing now, which is concerning 1004 01:00:24,050 --> 01:00:28,970 about most object categorization tasks instead 1005 01:00:28,970 --> 01:00:30,860 of identification. 1006 01:00:30,860 --> 01:00:33,680 And this is the reason why we decided 1007 01:00:33,680 --> 01:00:37,160 to acquire a new dataset, which is 1008 01:00:37,160 --> 01:00:40,190 larger than the previous ones. 1009 01:00:40,190 --> 01:00:49,520 Because it is composed not only by more categories, but, 1010 01:00:49,520 --> 01:00:53,210 in particular, by more instances per category 1011 01:00:53,210 --> 01:00:57,530 in order to be able to perform categorization experiments as I 1012 01:00:57,530 --> 01:00:58,670 told you. 1013 01:00:58,670 --> 01:01:02,660 Here, you can see the categories with which we are starting 1014 01:01:02,660 --> 01:01:08,850 are 28 divided into seven macro categories, let's say. 1015 01:01:08,850 --> 01:01:13,490 But the idea of this dataset is to have a continuously 1016 01:01:13,490 --> 01:01:16,310 expandable in time dataset. 1017 01:01:16,310 --> 01:01:18,830 So there is an application that we 1018 01:01:18,830 --> 01:01:20,850 used to acquire these datasets. 1019 01:01:20,850 --> 01:01:23,660 And the idea is to perform periodical acquisitions 1020 01:01:23,660 --> 01:01:29,120 in order to incrementally enrich the knowledge of the robot 1021 01:01:29,120 --> 01:01:31,440 about the objects in the scene. 1022 01:01:34,610 --> 01:01:38,990 Also, another important factor regarding this dataset 1023 01:01:38,990 --> 01:01:42,290 is that differently from the previous one 1024 01:01:42,290 --> 01:01:46,640 this would be divided and tagged by nuisance factors. 1025 01:01:46,640 --> 01:01:50,030 And in particular, for each object, 1026 01:01:50,030 --> 01:01:53,150 we are acquiring different videos 1027 01:01:53,150 --> 01:01:57,320 where we isolate the different transformations 1028 01:01:57,320 --> 01:01:59,390 that the object is undergoing. 1029 01:01:59,390 --> 01:02:02,350 So we have a video where the object is just 1030 01:02:02,350 --> 01:02:03,720 at different scales. 1031 01:02:03,720 --> 01:02:05,870 Then it is rotating on the plane. 1032 01:02:05,870 --> 01:02:08,709 Outside the plane, it is translating. 1033 01:02:08,709 --> 01:02:10,250 And then there is a final video where 1034 01:02:10,250 --> 01:02:14,490 all of this transformation occurs simultaneously. 1035 01:02:14,490 --> 01:02:16,490 And finally, to acquire this dataset we 1036 01:02:16,490 --> 01:02:24,290 decided to use the depth information, so that in the end 1037 01:02:24,290 --> 01:02:26,720 we acquired both the left and the right to cameras. 1038 01:02:26,720 --> 01:02:29,390 And in principle, this information 1039 01:02:29,390 --> 01:02:34,160 could be used to obtain the 3D structure of the objects. 1040 01:02:34,160 --> 01:02:40,850 And this is the idea that we used in order 1041 01:02:40,850 --> 01:02:44,570 to make the robot focusing on the object of interest 1042 01:02:44,570 --> 01:02:47,080 using disparity. 1043 01:02:47,080 --> 01:02:48,890 Disparity is very useful in this case, 1044 01:02:48,890 --> 01:02:57,410 because it allows to detect unknown objects just given 1045 01:02:57,410 --> 01:03:02,720 the fact that we know that we want the robot focused 1046 01:03:02,720 --> 01:03:05,690 on the closest objects in the scene. 1047 01:03:05,690 --> 01:03:09,470 So it is a very powerful method in order 1048 01:03:09,470 --> 01:03:11,240 to have the robot tracking a known 1049 01:03:11,240 --> 01:03:16,720 object with all different lighting conditions and so on. 1050 01:03:16,720 --> 01:03:17,220 Yeah. 1051 01:03:17,220 --> 01:03:22,070 And here, you can see this is the left camera. 1052 01:03:22,070 --> 01:03:24,230 This is the disparity map. 1053 01:03:24,230 --> 01:03:27,680 This is its segmentation, which provides an approximate region 1054 01:03:27,680 --> 01:03:29,540 of interest around the object. 1055 01:03:29,540 --> 01:03:31,010 And this is the final output. 1056 01:03:34,010 --> 01:03:36,940 So I started acquiring the first-- 1057 01:03:36,940 --> 01:03:41,330 well, it should be red, but it's not very clear I mean. 1058 01:03:41,330 --> 01:03:43,970 I started acquiring the first categories 1059 01:03:43,970 --> 01:03:48,470 among these 21 listed here, which 1060 01:03:48,470 --> 01:03:53,570 are the squeezer, sprayer, the cream, the oven 1061 01:03:53,570 --> 01:03:55,150 glove, and the bottle. 1062 01:03:55,150 --> 01:03:59,120 For each row, you see the tiny instances that I collected. 1063 01:03:59,120 --> 01:04:01,400 And the idea is to continue acquiring them 1064 01:04:01,400 --> 01:04:03,860 when I come back in Genoa. 1065 01:04:03,860 --> 01:04:09,720 And so here, you can see an example of the five videos. 1066 01:04:09,720 --> 01:04:12,100 Actually, I acquired 10 videos per object, 1067 01:04:12,100 --> 01:04:16,400 five for the training set and five for the test set. 1068 01:04:16,400 --> 01:04:19,710 And you can see that in the different videos 1069 01:04:19,710 --> 01:04:23,030 the object is undergoing different transformations. 1070 01:04:23,030 --> 01:04:28,090 And this is the final one while these transformation are mixed. 1071 01:04:28,090 --> 01:04:32,210 Oh, the images here are not segmented yet. 1072 01:04:32,210 --> 01:04:33,920 So you can see the whole image. 1073 01:04:33,920 --> 01:04:38,000 But in end, also the information about the segmentation 1074 01:04:38,000 --> 01:04:41,280 and disparity and so on will be available. 1075 01:04:41,280 --> 01:04:44,390 And this dataset regarding the 50 object 1076 01:04:44,390 --> 01:04:47,780 that I acquired together with the application 1077 01:04:47,780 --> 01:04:50,000 that I'm using to acquire the dataset 1078 01:04:50,000 --> 01:04:52,370 are available if you are willing to use them 1079 01:04:52,370 --> 01:04:55,730 for the projects in trust five, for example, in order 1080 01:04:55,730 --> 01:04:58,190 to investigate the invariant properties 1081 01:04:58,190 --> 01:05:01,170 of the different representations. 1082 01:05:01,170 --> 01:05:05,220 And so that's it.