1 00:00:00,500 --> 00:00:03,077 The following content is provided under a Creative 2 00:00:03,077 --> 00:00:03,819 Commons license. 3 00:00:03,819 --> 00:00:06,263 Your support will help MIT OpenCourseWare 4 00:00:06,263 --> 00:00:10,070 continue to offer high-quality educational resources for free. 5 00:00:10,070 --> 00:00:13,149 To make a donation or to view additional materials 6 00:00:13,149 --> 00:00:18,508 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,508 --> 00:00:26,060 at ocw.mit.edu. 8 00:00:26,060 --> 00:00:28,040 JEREMY KEPNER: All right, welcome. 9 00:00:28,040 --> 00:00:31,660 Thank you so much for coming. 10 00:00:31,660 --> 00:00:34,480 I'm Jeremy Kepner. 11 00:00:34,480 --> 00:00:37,220 I'm a fellow at Lincoln Laboratory. 12 00:00:37,220 --> 00:00:39,944 I lead the Supercomputing Center there, 13 00:00:39,944 --> 00:00:43,576 which means I have the privilege of working 14 00:00:43,576 --> 00:00:49,500 every day with pretty much everyone at MIT. 15 00:00:49,500 --> 00:00:52,484 I think I have the best job at MIT 16 00:00:52,484 --> 00:00:58,739 because I get to help you all pursue your research dreams. 17 00:00:58,739 --> 00:01:02,634 And as a result of that, I get an opportunity 18 00:01:02,634 --> 00:01:07,156 to see what a really wide range of folks are doing 19 00:01:07,156 --> 00:01:12,799 and observe patterns between what different folks are doing. 20 00:01:12,799 --> 00:01:15,360 So with that, I'll get started. 21 00:01:15,360 --> 00:01:19,380 This is meant to be some initial motivational material, 22 00:01:19,380 --> 00:01:22,559 why you should be interested in learning 23 00:01:22,559 --> 00:01:26,553 about this mathematics, this mathematics of big data 24 00:01:26,553 --> 00:01:31,009 and how it relates to machine learning and other really 25 00:01:31,009 --> 00:01:31,740 exciting topics. 26 00:01:31,740 --> 00:01:33,590 It is a math course. 27 00:01:33,590 --> 00:01:37,119 We will be going over a fair amount of math. 28 00:01:37,119 --> 00:01:41,920 But we really work hard to make it very accessible to people. 29 00:01:41,920 --> 00:01:47,332 So we start out with a really elementary mathematical concept 30 00:01:47,332 --> 00:01:50,549 here, probably one that hopefully most of you 31 00:01:50,549 --> 00:01:51,299 are familiar with. 32 00:01:51,299 --> 00:01:55,970 It's the basic concept of a circle, right? 33 00:01:55,970 --> 00:01:59,882 And I bring that up because many of us 34 00:01:59,882 --> 00:02:03,360 know many ways to state this mathematically, right? 35 00:02:03,360 --> 00:02:07,026 It's all the points that are equal distance 36 00:02:07,026 --> 00:02:08,860 from a particular point. 37 00:02:08,860 --> 00:02:10,610 There's other ways to describe it. 38 00:02:10,610 --> 00:02:13,865 But this is a basic mathematical concept of a circle 39 00:02:13,865 --> 00:02:16,470 that many of us have grown up with. 40 00:02:16,470 --> 00:02:21,024 But, of course, the other thing we know 41 00:02:21,024 --> 00:02:25,579 is that, right, this is the big idea. 42 00:02:25,579 --> 00:02:29,842 Although I can write down an equation for circle, which 43 00:02:29,842 --> 00:02:33,166 is the equation for a perfect, ideal circle, 44 00:02:33,166 --> 00:02:37,000 we know that such things don't actually exist in nature. 45 00:02:37,000 --> 00:02:42,459 There is no true perfect circle in nature. 46 00:02:42,459 --> 00:02:45,069 Even this circle that we've drawn here, it has pixels. 47 00:02:45,069 --> 00:02:47,201 If I zoomed in on it, if I zoomed in on it enough, 48 00:02:47,201 --> 00:02:49,349 it wouldn't look like a circle at all. 49 00:02:49,349 --> 00:02:52,050 It would look like a series of blocks. 50 00:02:52,050 --> 00:02:54,702 And so that approximation process, right, 51 00:02:54,702 --> 00:02:59,145 where we have a mathematical concept of an ideal circle, 52 00:02:59,145 --> 00:03:03,217 right, but we know that there are not really-- 53 00:03:03,217 --> 00:03:06,615 they don't really exist nature, but we 54 00:03:06,615 --> 00:03:10,591 understand that it is worthwhile to think 55 00:03:10,591 --> 00:03:14,386 about these mathematical ideals, manipulate them and then 56 00:03:14,386 --> 00:03:16,297 take the results of the manipulation 57 00:03:16,297 --> 00:03:17,890 back into the real world. 58 00:03:17,890 --> 00:03:21,367 That's a really productive way to think about things 59 00:03:21,367 --> 00:03:26,800 and, really, the basis for a lot of what we do here at MIT. 60 00:03:26,800 --> 00:03:32,828 This concept is essentially the basis of modern 61 00:03:32,828 --> 00:03:37,349 or ancient Western thought on mathematics. 62 00:03:37,349 --> 00:03:39,595 If you remember your history courses, 63 00:03:39,595 --> 00:03:42,590 this concept of ideal shapes and ideal circles 64 00:03:42,590 --> 00:03:47,810 is the foundation of platonic mathematics 65 00:03:47,810 --> 00:03:51,290 some 2,500 years ago. 66 00:03:51,290 --> 00:03:55,509 And at the time, though, that they were developing 67 00:03:55,509 --> 00:03:58,322 that concept, this idea that there 68 00:03:58,322 --> 00:04:01,307 are ideal shapes out there and that thinking about them 69 00:04:01,307 --> 00:04:03,579 and manipulating them was a more effective way 70 00:04:03,579 --> 00:04:09,349 to reason about the real world, there was a lot of skepticism. 71 00:04:09,349 --> 00:04:11,033 You could imagine 2,500 years ago 72 00:04:11,033 --> 00:04:12,717 someone is walking around and saying, 73 00:04:12,717 --> 00:04:15,036 I believe there are these things called 74 00:04:15,036 --> 00:04:17,988 ideal circles and ideal squares and ideal shapes. 75 00:04:17,988 --> 00:04:20,829 But they don't actually exist in nature. 76 00:04:20,829 --> 00:04:22,839 That would probably not be well-received. 77 00:04:22,839 --> 00:04:25,749 In fact, it was not well-received. 78 00:04:25,749 --> 00:04:31,674 Many of those philosophers who were thinking about this 79 00:04:31,674 --> 00:04:34,308 were very negatively received. 80 00:04:34,308 --> 00:04:36,744 And, in fact, if you want to learn 81 00:04:36,744 --> 00:04:39,180 about how negative the response was to this, 82 00:04:39,180 --> 00:04:43,211 I encourage you to go and read the Allegory of the Cave, which 83 00:04:43,211 --> 00:04:45,623 is essentially the story of these philosophers talking 84 00:04:45,623 --> 00:04:47,027 about how they're trying to bring 85 00:04:47,027 --> 00:04:49,233 the light of this knowledge to the broader world 86 00:04:49,233 --> 00:04:52,239 and how they essentially get killed because of it, 87 00:04:52,239 --> 00:04:54,909 because people don't want to see it. 88 00:04:54,909 --> 00:04:58,639 So that struggle they experienced 2,500 years ago, 89 00:04:58,639 --> 00:05:00,039 it exists today. 90 00:05:00,039 --> 00:05:04,322 You as people at MIT will try and bring mathematical concepts 91 00:05:04,322 --> 00:05:06,407 into environments where people are like, 92 00:05:06,407 --> 00:05:07,740 I don't see why that's relevant. 93 00:05:07,740 --> 00:05:12,050 And you will experience negative inputs. 94 00:05:12,050 --> 00:05:16,119 But you should rest assured that this is a good bet. 95 00:05:16,119 --> 00:05:18,759 It's worked well for thousands of years. 96 00:05:18,759 --> 00:05:20,789 You know, it's what I base my career on. 97 00:05:20,789 --> 00:05:22,622 People ask me, well, what's the basis of it? 98 00:05:22,622 --> 00:05:24,159 Well, I'm just betting on math here. 99 00:05:24,159 --> 00:05:26,100 It's been a good tool. 100 00:05:26,100 --> 00:05:29,704 So this is why we're beginning to think this way when 101 00:05:29,704 --> 00:05:32,999 we talk about big data and machine learning. 102 00:05:32,999 --> 00:05:35,560 So really looking at the fundamentals, what 103 00:05:35,560 --> 00:05:39,557 are the ideals that we need in order to effectively reason 104 00:05:39,557 --> 00:05:41,928 about the problems that we're facing today 105 00:05:41,928 --> 00:05:44,906 in the virtual world, right, and the fact 106 00:05:44,906 --> 00:05:50,363 that this mathematical concept described the natural world so 107 00:05:50,363 --> 00:05:54,422 well and also described in the virtual world 108 00:05:54,422 --> 00:05:56,872 is sometimes called the unreasonable effectiveness 109 00:05:56,872 --> 00:05:57,689 of mathematics. 110 00:05:57,689 --> 00:05:58,689 You can look that up. 111 00:05:58,689 --> 00:06:00,889 But people talk about math. 112 00:06:00,889 --> 00:06:04,539 Why does it do such a good job of describing so many things? 113 00:06:04,539 --> 00:06:07,319 And people say, well, they don't really know. 114 00:06:07,319 --> 00:06:11,379 But it seems to be a good bit of luck that it happens that way. 115 00:06:11,379 --> 00:06:16,449 So circles, that gets us a certain way. 116 00:06:16,449 --> 00:06:20,770 But in most of the fields that we work with, 117 00:06:20,770 --> 00:06:25,096 and I would say that, in almost any introductory course 118 00:06:25,096 --> 00:06:29,030 that you take in college, whatever the discipline is, 119 00:06:29,030 --> 00:06:32,620 whether it be chemistry or mechanical engineering 120 00:06:32,620 --> 00:06:36,917 or electrical engineering or physics or biology, 121 00:06:36,917 --> 00:06:39,725 the basic fundamental theoretical ideas 122 00:06:39,725 --> 00:06:43,590 that they will introduce to you will be 123 00:06:43,590 --> 00:06:46,490 the concept of a linear model. 124 00:06:46,490 --> 00:06:50,939 So there we have a linear model, right? 125 00:06:50,939 --> 00:06:53,850 And why do we like linear models? 126 00:06:53,850 --> 00:06:55,058 And again, it can be physics. 127 00:06:55,058 --> 00:06:59,429 It can be as simple as F = MA Or, in chemistry, it 128 00:06:59,429 --> 00:07:02,119 can be some kind of chemical rate equation. 129 00:07:02,119 --> 00:07:04,873 Or in mechanical engineering it can 130 00:07:04,873 --> 00:07:07,169 be basic concepts of friction. 131 00:07:07,169 --> 00:07:11,181 The reason we like these basic linear models 132 00:07:11,181 --> 00:07:14,190 is because we can project, right? 133 00:07:14,190 --> 00:07:17,283 I know that if that solid line represents 134 00:07:17,283 --> 00:07:20,376 what I believe to-- you know, if I 135 00:07:20,376 --> 00:07:23,287 have evidence to support that that is correct, 136 00:07:23,287 --> 00:07:27,231 then I feel pretty good about projecting maybe where I don't 137 00:07:27,231 --> 00:07:29,909 have data or into a new domain. 138 00:07:29,909 --> 00:07:33,308 So linear models allow us to do this reasoning. 139 00:07:33,308 --> 00:07:36,749 And that's why in the first few weeks of almost 140 00:07:36,749 --> 00:07:40,029 any introductory course they begin with these linear models, 141 00:07:40,029 --> 00:07:43,509 because they have proven to be so effective. 142 00:07:43,509 --> 00:07:47,645 Now, there are many non-linear phenomena that 143 00:07:47,645 --> 00:07:50,009 are tremendously important, OK? 144 00:07:50,009 --> 00:07:54,239 And as a person who deals with large-scale computation, 145 00:07:54,239 --> 00:07:58,529 those are a staple of what people do. 146 00:07:58,529 --> 00:08:02,341 But in order to do non-linear calculations or reason 147 00:08:02,341 --> 00:08:04,459 about things non-linearly, it usually 148 00:08:04,459 --> 00:08:09,325 requires a much more complicated analysis and much more 149 00:08:09,325 --> 00:08:11,489 computation, much more data. 150 00:08:11,489 --> 00:08:14,783 And so our ability to extrapolate 151 00:08:14,783 --> 00:08:16,979 is very limited, OK? 152 00:08:16,979 --> 00:08:18,120 It's very limited. 153 00:08:18,120 --> 00:08:23,205 So here I am talking about the benefits 154 00:08:23,205 --> 00:08:27,020 of thinking mathematically, talking about linearity. 155 00:08:27,020 --> 00:08:32,679 What does this have to do with big data and machine learning? 156 00:08:32,679 --> 00:08:37,096 So we would like to be able to do the same things that we've 157 00:08:37,096 --> 00:08:41,795 been able to do in other fields in this new emerging 158 00:08:41,795 --> 00:08:44,039 field of big data. 159 00:08:44,039 --> 00:08:47,399 And this often deals with data that 160 00:08:47,399 --> 00:08:50,759 doesn't look like the traditional measurements we 161 00:08:50,759 --> 00:08:53,350 see in science. 162 00:08:53,350 --> 00:08:58,778 This can be data that has to do with words or images, pictures 163 00:08:58,778 --> 00:09:01,831 of people, other types of things that don't feel 164 00:09:01,831 --> 00:09:04,043 like the kinds of data that we traditionally 165 00:09:04,043 --> 00:09:06,899 deal with in science and engineering. 166 00:09:06,899 --> 00:09:11,350 But we know we want to use linear models. 167 00:09:11,350 --> 00:09:13,259 So how are we going to do that? 168 00:09:13,259 --> 00:09:15,689 How can we take this concept of linearity, 169 00:09:15,689 --> 00:09:18,558 which has been so powerful across so many disciplines, 170 00:09:18,558 --> 00:09:21,627 and bring them to this field that 171 00:09:21,627 --> 00:09:25,394 just feels completely different than the kinds data 172 00:09:25,394 --> 00:09:26,970 that we have? 173 00:09:26,970 --> 00:09:33,806 So to begin with, I need to refresh for you what it really 174 00:09:33,806 --> 00:09:35,910 means to be linear. 175 00:09:35,910 --> 00:09:39,470 Before, I showed you a line and, hence, the line, linear. 176 00:09:39,470 --> 00:09:43,720 But mathematically, linearity means something much deeper. 177 00:09:43,720 --> 00:09:48,591 And so here's an equation that you may have first seen 178 00:09:48,591 --> 00:09:49,920 in elementary school. 179 00:09:49,920 --> 00:09:52,316 We basically have to two times three 180 00:09:52,316 --> 00:09:57,120 plus four is equal to two times three plus two times four. 181 00:09:57,120 --> 00:09:59,250 That is called the distributive property. 182 00:09:59,250 --> 00:10:03,810 It basically says multiplication distributes over addition. 183 00:10:03,810 --> 00:10:06,198 And this is the fundamental reason 184 00:10:06,198 --> 00:10:10,180 why I would say mathematics works in our world, right? 185 00:10:10,180 --> 00:10:14,103 If this wasn't true very early on in the earliest 186 00:10:14,103 --> 00:10:16,850 days of inventing mathematics, it would not 187 00:10:16,850 --> 00:10:19,050 have been very useful, right? 188 00:10:19,050 --> 00:10:24,030 To say that I have two of three plus four of something, OK, 189 00:10:24,030 --> 00:10:27,559 and then I can change it and do it 190 00:10:27,559 --> 00:10:31,660 in this other way, that's really what makes mathematics useful. 191 00:10:31,660 --> 00:10:36,562 And from a deeper perspective, the distributive property 192 00:10:36,562 --> 00:10:40,240 is basically what makes math linear. 193 00:10:40,240 --> 00:10:44,713 This is the property that, if this property holds, 194 00:10:44,713 --> 00:10:48,689 then we can reason about a system linearly. 195 00:10:48,689 --> 00:10:52,289 Now, you're very familiar with this type of mathematics, 196 00:10:52,289 --> 00:10:54,690 but there's other types of mathematics. 197 00:10:54,690 --> 00:10:57,457 So if you'll allow me, hopefully you 198 00:10:57,457 --> 00:11:00,620 will let me just replace those multiplication symbols 199 00:11:00,620 --> 00:11:04,879 and addition symbols with this funny circle times and circle 200 00:11:04,879 --> 00:11:05,379 plus. 201 00:11:05,379 --> 00:11:08,199 And we'll get to why I'm going to do that. 202 00:11:08,199 --> 00:11:10,938 Because it turns out that, while you 203 00:11:10,938 --> 00:11:14,487 have done most of your careers with traditional arithmetic 204 00:11:14,487 --> 00:11:16,580 multiplication and addition, the kind 205 00:11:16,580 --> 00:11:19,930 you would do on your calculator or have 206 00:11:19,930 --> 00:11:24,015 done in elementary school, it turns out 207 00:11:24,015 --> 00:11:26,934 there's other pairs of operations 208 00:11:26,934 --> 00:11:31,229 that also obey this property, this distributive property, 209 00:11:31,229 --> 00:11:34,659 and, therefore, allow us to potentially build 210 00:11:34,659 --> 00:11:38,530 linear models of very different types of data 211 00:11:38,530 --> 00:11:39,980 using this property. 212 00:11:39,980 --> 00:11:43,545 So, as I mentioned, the classic two 213 00:11:43,545 --> 00:11:48,639 are circle plus is just equal to regular arithmetic addition, 214 00:11:48,639 --> 00:11:51,918 as we show on the first line, and circle times is 215 00:11:51,918 --> 00:11:53,709 equal to regular arithmetic multiplication. 216 00:11:53,709 --> 00:11:55,779 So those are the standard ones. 217 00:11:55,779 --> 00:11:59,665 And, by far, this pair, this is the most common pair 218 00:11:59,665 --> 00:12:02,139 that we use across the world today. 219 00:12:02,139 --> 00:12:03,550 But there are others. 220 00:12:03,550 --> 00:12:09,753 So, for instance, I can replace the plus operation 221 00:12:09,753 --> 00:12:17,050 with max and the multiplication operation with addition, OK? 222 00:12:17,050 --> 00:12:19,644 And the above distributive equation 223 00:12:19,644 --> 00:12:21,720 will still hold, right? 224 00:12:21,720 --> 00:12:22,860 That's a little confusing. 225 00:12:22,860 --> 00:12:26,868 I often get confused that multiplications is now 226 00:12:26,868 --> 00:12:27,370 addition. 227 00:12:27,370 --> 00:12:30,431 But this pair sometimes referred to as max plus-- you'll 228 00:12:30,431 --> 00:12:34,205 sometimes hear about it as max plus algebra-- is actually 229 00:12:34,205 --> 00:12:38,079 very important in machine learning and neural networks. 230 00:12:38,079 --> 00:12:43,015 This is actually the back end of the rectified linear unit, 231 00:12:43,015 --> 00:12:44,663 is essentially this operation. 232 00:12:44,663 --> 00:12:46,829 If you didn't understand what that meant, that's OK. 233 00:12:46,829 --> 00:12:49,350 We'll get to that later. 234 00:12:49,350 --> 00:12:51,220 It's very important in finance. 235 00:12:51,220 --> 00:12:54,123 There are certain finance operations 236 00:12:54,123 --> 00:12:58,189 that rely on this type of mathematics. 237 00:12:58,189 --> 00:13:01,880 There are other pairs, also. 238 00:13:01,880 --> 00:13:02,880 So here's one. 239 00:13:02,880 --> 00:13:08,000 I can replace addition with union and multiplication 240 00:13:08,000 --> 00:13:09,920 with intersection, right? 241 00:13:09,920 --> 00:13:15,850 Now, that also obeys that linear property. 242 00:13:15,850 --> 00:13:18,430 This is essentially the pair of operations 243 00:13:18,430 --> 00:13:21,380 that, anytime you make a transaction and work 244 00:13:21,380 --> 00:13:24,663 with what's called a relational database, that's 245 00:13:24,663 --> 00:13:28,689 the mathematical operation pair that's sitting inside it. 246 00:13:28,689 --> 00:13:31,790 It's why those databases work. 247 00:13:31,790 --> 00:13:36,770 It allows us to reason about queries, which are just 248 00:13:36,770 --> 00:13:39,759 a series of intersections and unions, 249 00:13:39,759 --> 00:13:41,959 and then reorder them in such a way. 250 00:13:41,959 --> 00:13:45,100 In databases, this is called query planning. 251 00:13:45,100 --> 00:13:46,507 And if that property wasn't true, 252 00:13:46,507 --> 00:13:48,149 we wouldn't be able to do that. 253 00:13:48,149 --> 00:13:51,079 So this is a deep property of that. 254 00:13:51,079 --> 00:13:55,293 So we can put all different types of pairs in here 255 00:13:55,293 --> 00:13:57,209 and reason about them linearly. 256 00:13:57,209 --> 00:14:01,587 And this is why that many, many of the systems 257 00:14:01,587 --> 00:14:03,339 we use today work. 258 00:14:03,339 --> 00:14:06,005 And so this class is about really exposing 259 00:14:06,005 --> 00:14:08,005 that, that, really, the mathematics that 260 00:14:08,005 --> 00:14:11,769 allows us to think linearly about data that we haven't 261 00:14:11,769 --> 00:14:14,056 really thought of as maybe obeying 262 00:14:14,056 --> 00:14:16,080 some kind of linear model. 263 00:14:16,080 --> 00:14:20,720 This is essentially the critical point of this class. 264 00:14:20,720 --> 00:14:24,339 So it goes beyond that, though. 265 00:14:24,339 --> 00:14:28,434 So hopefully you'll allow me to replace those numbers 266 00:14:28,434 --> 00:14:29,800 with letters, right? 267 00:14:29,800 --> 00:14:35,189 So that's basic algebra there. 268 00:14:35,189 --> 00:14:38,105 Just for a refresher, the previous equation, 269 00:14:38,105 --> 00:14:42,689 we had A = 2, B = 3, C = 4. 270 00:14:42,689 --> 00:14:50,075 But we're not limited to these variables, or these letters, 271 00:14:50,075 --> 00:14:53,819 to being just simple scalar numbers, 272 00:14:53,819 --> 00:14:55,944 in this case, real numbers or integers or something 273 00:14:55,944 --> 00:14:56,444 like that. 274 00:14:56,444 --> 00:14:58,360 They can be other things, too. 275 00:14:58,360 --> 00:15:04,300 So, for instance, A, B, and C could be spreadsheets. 276 00:15:04,300 --> 00:15:06,997 And that's something we'll go over with extensively 277 00:15:06,997 --> 00:15:09,695 in a class, so that I can basically 278 00:15:09,695 --> 00:15:13,767 have A, B, and C be whole spreadsheets of data 279 00:15:13,767 --> 00:15:16,749 and the linear equation will still hold. 280 00:15:16,749 --> 00:15:22,385 And, in fact, that's probably the key concept in big data, 281 00:15:22,385 --> 00:15:26,142 is the necessity to reason about data 282 00:15:26,142 --> 00:15:30,920 as whole collections and transforming whole collections. 283 00:15:30,920 --> 00:15:34,972 Going and looking at things one element at a time 284 00:15:34,972 --> 00:15:40,249 is essentially the thing that is extremely difficult to do when 285 00:15:40,249 --> 00:15:43,910 you have large amounts of data. 286 00:15:43,910 --> 00:15:50,730 A, B, and C can be database tables, right? 287 00:15:50,730 --> 00:15:52,720 Those don't differ too much from spreadsheets. 288 00:15:52,720 --> 00:15:57,220 And as I talked to you in the previous slide, 289 00:15:57,220 --> 00:16:00,262 that union/intersection pair naturally lines up 290 00:16:00,262 --> 00:16:05,812 and we can reason about whole tables 291 00:16:05,812 --> 00:16:10,569 in a database using linear properties. 292 00:16:10,569 --> 00:16:11,939 They can be matrices. 293 00:16:11,939 --> 00:16:15,433 I think, for those of you who have had a linear algebra 294 00:16:15,433 --> 00:16:17,502 and matrix mathematics, that would have been 295 00:16:17,502 --> 00:16:21,180 the first example, right, when I substituted the A, B, and C 296 00:16:21,180 --> 00:16:23,689 and had these linear equations. 297 00:16:23,689 --> 00:16:27,539 Often, in many of the sciences, we 298 00:16:27,539 --> 00:16:30,839 think about matrix operations and linearity 299 00:16:30,839 --> 00:16:36,420 as being coupled together. 300 00:16:36,420 --> 00:16:39,762 And through the duality between matrices and graphs 301 00:16:39,762 --> 00:16:43,314 and networks, we can represent graphs and networks 302 00:16:43,314 --> 00:16:44,360 through matrices. 303 00:16:44,360 --> 00:16:47,308 Any time you work with a neural network, 304 00:16:47,308 --> 00:16:49,889 you're representing that network as a matrix. 305 00:16:49,889 --> 00:16:53,257 And, of course, all these equations apply there as well 306 00:16:53,257 --> 00:16:56,889 and you can reason about those systems linearly. 307 00:16:56,889 --> 00:17:03,529 So that provides a little motivation there. 308 00:17:03,529 --> 00:17:05,950 As we like to say, enough about me, 309 00:17:05,950 --> 00:17:08,069 let me tell you about my book. 310 00:17:08,069 --> 00:17:12,209 So this will be the text that will we use in the class. 311 00:17:12,209 --> 00:17:15,745 We are not going to go through the full text, 312 00:17:15,745 --> 00:17:19,623 but we have printed out copies of the first seven chapters 313 00:17:19,623 --> 00:17:21,359 that we will go through. 314 00:17:21,359 --> 00:17:29,280 And we will hand those out later when you do the class. 315 00:17:29,280 --> 00:17:33,055 So let me now switch gears a little bit 316 00:17:33,055 --> 00:17:36,831 and talk about how this relates to, I think, 317 00:17:36,831 --> 00:17:38,831 one of the most wonderful breakthroughs 318 00:17:38,831 --> 00:17:41,992 that we have seen, or I've seen in my career, 319 00:17:41,992 --> 00:17:45,489 and many of my colleagues here at MIT have seen, 320 00:17:45,489 --> 00:17:48,670 which is what's been going on in machine learning, 321 00:17:48,670 --> 00:17:53,760 right, which is-- it's not hype. 322 00:17:53,760 --> 00:17:58,450 There's a real real there there and it's tremendously exciting. 323 00:17:58,450 --> 00:18:01,650 So let me give you a little history, basic history 324 00:18:01,650 --> 00:18:02,610 of this field. 325 00:18:02,610 --> 00:18:08,707 So in a certain sense, before 2010, machine learning 326 00:18:08,707 --> 00:18:10,740 looked like this. 327 00:18:10,740 --> 00:18:17,720 And then, after 2015, it kind of looks like this. 328 00:18:17,720 --> 00:18:21,519 So when people talk about the hype in machine learning, 329 00:18:21,519 --> 00:18:23,799 or AI, really deep neural networks 330 00:18:23,799 --> 00:18:28,360 are the elephant inside the machine learning snake. 331 00:18:28,360 --> 00:18:32,488 It has stormed onto the scene in the last five years 332 00:18:32,488 --> 00:18:38,316 and basically allowed us to do things that we had almost taken 333 00:18:38,316 --> 00:18:40,700 for granted were impossible. 334 00:18:40,700 --> 00:18:44,756 Just the fact that you're able to talk to computers 335 00:18:44,756 --> 00:18:48,989 and they can understand you, that we can have computers that 336 00:18:48,989 --> 00:18:53,693 can see at least in a way that approximates the way humans do, 337 00:18:53,693 --> 00:18:55,997 these are really almost technological miracles 338 00:18:55,997 --> 00:18:59,387 that, for those of us who have been working 339 00:18:59,387 --> 00:19:02,884 on this field for fifty years, we had almost literally given 340 00:19:02,884 --> 00:19:03,520 up on. 341 00:19:03,520 --> 00:19:06,710 And then all of a sudden it became possible. 342 00:19:06,710 --> 00:19:09,796 So let me give you a little sense of appreciation 343 00:19:09,796 --> 00:19:11,649 for this field and its roots. 344 00:19:11,649 --> 00:19:15,361 So machine learning, like any field, 345 00:19:15,361 --> 00:19:20,929 is defined as a set of techniques and problems. 346 00:19:20,929 --> 00:19:23,435 When you ask what defines a field, you ask, well, 347 00:19:23,435 --> 00:19:26,552 what are the problems that they work on that other fields don't 348 00:19:26,552 --> 00:19:27,370 really work on? 349 00:19:27,370 --> 00:19:30,635 And what are the techniques they employ that really are not 350 00:19:30,635 --> 00:19:32,120 really being employed by them? 351 00:19:32,120 --> 00:19:35,706 So the core techniques, as I mentioned earlier, 352 00:19:35,706 --> 00:19:37,500 are these neural networks. 353 00:19:37,500 --> 00:19:41,514 These are meant to crudely approximate maybe the way 354 00:19:41,514 --> 00:19:43,299 humans think about problems. 355 00:19:43,299 --> 00:19:45,549 We have these circles which are neurons. 356 00:19:45,549 --> 00:19:47,980 They have connections to other neurons. 357 00:19:47,980 --> 00:19:50,612 You know, those connections have different weights 358 00:19:50,612 --> 00:19:51,740 associated with them. 359 00:19:51,740 --> 00:19:53,636 As information comes in, they get 360 00:19:53,636 --> 00:19:54,900 multiplied by those weights. 361 00:19:54,900 --> 00:19:56,240 They get summed together. 362 00:19:56,240 --> 00:19:58,842 And if they pass certain thresholds or criteria, 363 00:19:58,842 --> 00:20:01,770 then they send a signal on to another neuron. 364 00:20:01,770 --> 00:20:04,267 And this is, to a certain degree, 365 00:20:04,267 --> 00:20:07,479 how we believe the human brain works and is 366 00:20:07,479 --> 00:20:10,338 a natural starting point for, how could 367 00:20:10,338 --> 00:20:13,020 we make computers do similar things? 368 00:20:13,020 --> 00:20:17,484 The big problems that people have worked on 369 00:20:17,484 --> 00:20:21,390 are these classic problems in machine learning, 370 00:20:21,390 --> 00:20:24,955 are language, how do we make computers understand 371 00:20:24,955 --> 00:20:28,520 human language, vision, how do we make computers 372 00:20:28,520 --> 00:20:31,460 see pictures or explain pictures back to us the way 373 00:20:31,460 --> 00:20:34,876 we would like, and strategy and games and other types of things 374 00:20:34,876 --> 00:20:35,419 like that. 375 00:20:35,419 --> 00:20:38,390 So how do we get them to solve problems? 376 00:20:38,390 --> 00:20:40,900 This is not new. 377 00:20:40,900 --> 00:20:45,557 These core concepts trace back to the earliest days 378 00:20:45,557 --> 00:20:47,110 of the field. 379 00:20:47,110 --> 00:20:50,531 In fact, these four figures here, each one 380 00:20:50,531 --> 00:20:53,952 is taken from a paper that was presented 381 00:20:53,952 --> 00:20:58,041 at the very first machine learning 382 00:20:58,041 --> 00:21:00,680 conference in the mid-1950s. 383 00:21:00,680 --> 00:21:03,179 So there was a machine learning conference in the mid-1950s. 384 00:21:03,179 --> 00:21:06,370 It was in Los Angeles. 385 00:21:06,370 --> 00:21:09,700 It had four papers presented. 386 00:21:09,700 --> 00:21:12,930 These were the four papers. 387 00:21:12,930 --> 00:21:16,779 And I will say that three of them 388 00:21:16,779 --> 00:21:21,110 were done by folks at MIT Lincoln Laboratory, which 389 00:21:21,110 --> 00:21:22,570 is where I work. 390 00:21:22,570 --> 00:21:25,290 And so that was basically the neural networks 391 00:21:25,290 --> 00:21:26,650 of language and vision. 392 00:21:26,650 --> 00:21:31,510 And we didn't play games, so that was it. 393 00:21:31,510 --> 00:21:33,309 And you might say, well, why is it? 394 00:21:33,309 --> 00:21:36,198 Why was there so much work going on 395 00:21:36,198 --> 00:21:38,366 in Lincoln Laboratory in the mid-1950s 396 00:21:38,366 --> 00:21:43,740 that they would want to pioneer in these directions? 397 00:21:43,740 --> 00:21:48,065 At that time, people were first building computers 398 00:21:48,065 --> 00:21:51,310 and computers were very special purpose. 399 00:21:51,310 --> 00:21:53,691 So different organizations around the world 400 00:21:53,691 --> 00:21:56,470 were building computers to do different things. 401 00:21:56,470 --> 00:22:06,886 Some were doing them to simulate complex fluid dynamics systems, 402 00:22:06,886 --> 00:22:11,097 think about designing ships or other types 403 00:22:11,097 --> 00:22:13,650 of things like that or airplanes. 404 00:22:13,650 --> 00:22:17,162 Others were doing them to, say, like what Alan Turing was 405 00:22:17,162 --> 00:22:18,120 doing, break codes. 406 00:22:18,120 --> 00:22:23,079 And our task was to help people who 407 00:22:23,079 --> 00:22:27,419 were watching radar scopes make decisions, right? 408 00:22:27,419 --> 00:22:32,761 How could computers enable humans to watch more sensors 409 00:22:32,761 --> 00:22:35,730 and see where they're going? 410 00:22:35,730 --> 00:22:36,730 How could we do that? 411 00:22:36,730 --> 00:22:40,136 So at Lincoln Laboratory, we were building special purpose 412 00:22:40,136 --> 00:22:41,650 computers to do this. 413 00:22:41,650 --> 00:22:46,257 And we built the first large computer 414 00:22:46,257 --> 00:22:48,890 with reliable, fast memory. 415 00:22:48,890 --> 00:22:57,242 This system had 4,096 bytes of memory, which, at the time, 416 00:22:57,242 --> 00:23:01,039 people thought was too much. 417 00:23:01,039 --> 00:23:06,490 What could you possibly do with 4,096 numbers? 418 00:23:06,490 --> 00:23:08,710 The human brain, of course! 419 00:23:08,710 --> 00:23:10,169 Right, that's enough, right? 420 00:23:10,169 --> 00:23:13,590 Most of us can remember five, six, seven digits, right? 421 00:23:13,590 --> 00:23:16,705 So a computer that can remember 4,096 numbers 422 00:23:16,705 --> 00:23:20,790 should be able to do things like language and vision 423 00:23:20,790 --> 00:23:21,760 and strategy. 424 00:23:21,760 --> 00:23:23,149 So why not? 425 00:23:23,149 --> 00:23:27,052 So they went out and they started 426 00:23:27,052 --> 00:23:29,840 working on these problems, OK? 427 00:23:29,840 --> 00:23:33,400 But Lincoln Laboratory, being an applied research laboratory, 428 00:23:33,400 --> 00:23:40,176 we are required to get answers to our sponsors in a few years' 429 00:23:40,176 --> 00:23:41,350 time frame. 430 00:23:41,350 --> 00:23:44,341 If problems are going to take longer than that, 431 00:23:44,341 --> 00:23:48,679 then they really are the purview of the basic research 432 00:23:48,679 --> 00:23:50,019 community, universities. 433 00:23:50,019 --> 00:23:51,969 And it became apparent pretty early 434 00:23:51,969 --> 00:23:55,220 on that this problem was going to be more difficult. 435 00:23:55,220 --> 00:23:59,649 It was not going to be solved right away. 436 00:23:59,649 --> 00:24:03,420 So we did what we often do, is we partnered. 437 00:24:03,420 --> 00:24:07,266 We found some bright young people at MIT, people 438 00:24:07,266 --> 00:24:08,549 just like yourselves. 439 00:24:08,549 --> 00:24:14,610 In this case, we found a young professor named Marvin Minsky. 440 00:24:14,610 --> 00:24:18,950 And we said, why don't you go and get some of your friends 441 00:24:18,950 --> 00:24:20,867 together and create a meeting where 442 00:24:20,867 --> 00:24:23,096 you can lay out what the fundamental challenges are 443 00:24:23,096 --> 00:24:23,840 of this field? 444 00:24:23,840 --> 00:24:26,885 And then we will figure out how to get that funded so that you 445 00:24:26,885 --> 00:24:28,190 can go and do that research. 446 00:24:28,190 --> 00:24:32,116 And that was the famous Dartmouth AI conference 447 00:24:32,116 --> 00:24:34,570 which kicked off the field. 448 00:24:34,570 --> 00:24:38,809 And the person leading this group, Oliver Selfridge 449 00:24:38,809 --> 00:24:42,563 at Lincoln Laboratory, basically arranged for that conference 450 00:24:42,563 --> 00:24:46,240 to happen and then subsequently arranged for what would 451 00:24:46,240 --> 00:24:51,880 became the MIT AI Lab that was founded by Professor Minsky. 452 00:24:51,880 --> 00:24:54,064 And likewise, Professor Selfridge 453 00:24:54,064 --> 00:24:58,980 also realized that we would need more computing power. 454 00:24:58,980 --> 00:25:01,530 So he left Lincoln Laboratory and formed 455 00:25:01,530 --> 00:25:04,809 what was called Project MAC, which became the Laboratory 456 00:25:04,809 --> 00:25:06,179 for Computer Science. 457 00:25:06,179 --> 00:25:11,640 And then those two entities later merged 30 years later 458 00:25:11,640 --> 00:25:13,279 to become CSAIL. 459 00:25:13,279 --> 00:25:16,000 So that was the initial thing. 460 00:25:16,000 --> 00:25:20,199 Now, it was pretty clear that, when this problem was handed 461 00:25:20,199 --> 00:25:22,490 off to the basic research community, 462 00:25:22,490 --> 00:25:25,655 there was a feeling that these problems would 463 00:25:25,655 --> 00:25:28,029 be solved in about a decade. 464 00:25:28,029 --> 00:25:32,404 So we were really thinking by the mid-1960s 465 00:25:32,404 --> 00:25:36,779 is when these problems would be really solved. 466 00:25:36,779 --> 00:25:40,639 So it's like giving someone an assignment, right? 467 00:25:40,639 --> 00:25:43,360 You all are given assignments by professors 468 00:25:43,360 --> 00:25:46,860 and they give you a week to do it. 469 00:25:46,860 --> 00:25:49,460 But it took a little longer. 470 00:25:49,460 --> 00:25:55,294 In this case, it took five weeks or, in this case, five decades 471 00:25:55,294 --> 00:25:57,090 to solve this problem. 472 00:25:57,090 --> 00:25:58,090 But we have. 473 00:25:58,090 --> 00:26:01,966 We have now really, using those techniques, 474 00:26:01,966 --> 00:26:05,289 made tremendous progress on those problems. 475 00:26:05,289 --> 00:26:07,740 But we don't know why it works. 476 00:26:07,740 --> 00:26:09,198 So we made this tremendous progress 477 00:26:09,198 --> 00:26:09,770 but we don't really understand why this works. 478 00:26:09,770 --> 00:26:12,490 So let me show you a little bit what we have learned, 479 00:26:12,490 --> 00:26:15,084 and this course will explore the deeper mathematics 480 00:26:15,084 --> 00:26:18,135 to help us gain insight. 481 00:26:18,135 --> 00:26:19,510 We still don't know why it works. 482 00:26:19,510 --> 00:26:22,195 At least we can lay the foundations 483 00:26:22,195 --> 00:26:24,880 and maybe you can figure it out. 484 00:26:24,880 --> 00:26:30,040 So here I am, fifty years later, a person 485 00:26:30,040 --> 00:26:33,480 from Lincoln Laboratory saying, "All right. 486 00:26:33,480 --> 00:26:36,030 Question one has been answered. 487 00:26:36,030 --> 00:26:37,370 Here's question two." 488 00:26:37,370 --> 00:26:38,370 Ha. 489 00:26:38,370 --> 00:26:41,357 Why does this work and hopefully you can begin, 490 00:26:41,357 --> 00:26:43,150 be the generation figured it out. 491 00:26:43,150 --> 00:26:45,140 Hopefully it'll take less than fifty years. 492 00:26:45,140 --> 00:26:48,295 Historically this type once we know how it works, 493 00:26:48,295 --> 00:26:51,970 it usually takes about twenty years to figure out why. 494 00:26:51,970 --> 00:26:54,715 So I mean impasses but maybe maybe you 495 00:26:54,715 --> 00:26:58,779 know some people are smarter and they'll figure it out faster. 496 00:26:58,779 --> 00:27:05,870 So this is what a neural network looks like. 497 00:27:05,870 --> 00:27:10,130 On the left you have your input, in this case, a vector, y zero. 498 00:27:10,130 --> 00:27:15,090 It's just these dots that are called features. 499 00:27:15,090 --> 00:27:17,760 What is a feature? 500 00:27:17,760 --> 00:27:20,210 Anything can be a feature. 501 00:27:20,210 --> 00:27:23,418 That is the power of neural networks, 502 00:27:23,418 --> 00:27:28,002 is they don't require you to a priori state what 503 00:27:28,002 --> 00:27:29,461 the inputs can be. 504 00:27:29,461 --> 00:27:31,130 They can be anything. 505 00:27:31,130 --> 00:27:32,506 People have said, well, you know, 506 00:27:32,506 --> 00:27:33,921 neural networks, machine learning, 507 00:27:33,921 --> 00:27:34,980 it's just curve fitting. 508 00:27:34,980 --> 00:27:38,380 Yeah, but it's curve fitting without domain knowledge. 509 00:27:38,380 --> 00:27:42,065 Because domain knowledge is so costly and expensive 510 00:27:42,065 --> 00:27:47,415 to create that having a general system that can do this 511 00:27:47,415 --> 00:27:50,000 is really what's so powerful. 512 00:27:50,000 --> 00:27:52,480 So the inputs: we have a input feature. 513 00:27:52,480 --> 00:27:56,380 It could be a vector, which we call y sub zero. 514 00:27:56,380 --> 00:28:00,010 And that can just be an image, right, the canonical thing 515 00:28:00,010 --> 00:28:02,320 being an image of a cat, right? 516 00:28:02,320 --> 00:28:05,484 And that can just be the pixels, values just 517 00:28:05,484 --> 00:28:10,419 rolled out into a vector, and they will be the inputs. 518 00:28:10,419 --> 00:28:12,100 And then we have a series of layers. 519 00:28:12,100 --> 00:28:14,560 These are called hidden layers. 520 00:28:14,560 --> 00:28:19,639 The circles are often referred to as neurons, OK? 521 00:28:19,639 --> 00:28:22,682 And each line connecting each dot 522 00:28:22,682 --> 00:28:26,740 has a value associated with it, a weight. 523 00:28:26,740 --> 00:28:28,863 And the strength of the connection 524 00:28:28,863 --> 00:28:32,049 between any two neurons is given by that weight. 525 00:28:32,049 --> 00:28:36,375 And then, ultimately, the output, in this case, 526 00:28:36,375 --> 00:28:40,871 the output classification, the series of blue dots there, 527 00:28:40,871 --> 00:28:43,110 are the different possible categories. 528 00:28:43,110 --> 00:28:46,455 So if I put in a cat picture, one of those dots would be cat, 529 00:28:46,455 --> 00:28:51,189 maybe one would be dog, maybe one would be apple or orange, 530 00:28:51,189 --> 00:28:52,740 whatever I desired. 531 00:28:52,740 --> 00:28:57,040 And the whole idea is that, if I put in a picture of a cat 532 00:28:57,040 --> 00:28:59,302 and I set all these values correctly, 533 00:28:59,302 --> 00:29:02,558 then the dot corresponding to cat 534 00:29:02,558 --> 00:29:06,900 will end up with the highest score, right? 535 00:29:06,900 --> 00:29:11,596 And then I mentioned earlier that each one of these neurons 536 00:29:11,596 --> 00:29:12,450 collects inputs. 537 00:29:12,450 --> 00:29:14,606 And if it's above a certain threshold, 538 00:29:14,606 --> 00:29:18,380 it then chooses to pass on information to the next. 539 00:29:18,380 --> 00:29:21,234 And that's where these b values, which are vectors, 540 00:29:21,234 --> 00:29:22,820 are just the thresholds associated 541 00:29:22,820 --> 00:29:24,000 with each one of those. 542 00:29:24,000 --> 00:29:29,194 It's a vector, one value associated with each one 543 00:29:29,194 --> 00:29:32,080 of those that does those. 544 00:29:32,080 --> 00:29:35,897 This entire system can be represented relatively 545 00:29:35,897 --> 00:29:39,169 simply with one equation, which is 546 00:29:39,169 --> 00:29:44,477 that yi plus one, which is the next vector in the layer, 547 00:29:44,477 --> 00:29:48,780 OK, can be computed by the previous vector, yi 548 00:29:48,780 --> 00:29:52,834 matrix multiplied by the weight, W. 549 00:29:52,834 --> 00:29:56,213 So whenever you see transformations 550 00:29:56,213 --> 00:30:02,267 from one set of neurons to the next layer, you should think, 551 00:30:02,267 --> 00:30:06,401 oh, I have a matrix that represents all those weights 552 00:30:06,401 --> 00:30:09,720 and I'm going to multiply it by the vector to get the next one. 553 00:30:09,720 --> 00:30:13,700 Then we apply these thresholds, all right? 554 00:30:13,700 --> 00:30:17,492 So we add these, the bi's, and then 555 00:30:17,492 --> 00:30:21,759 we have a function that we pass it through. 556 00:30:21,759 --> 00:30:27,316 Typically, this h function has been given the name 557 00:30:27,316 --> 00:30:29,169 rectified linear unit. 558 00:30:29,169 --> 00:30:30,409 It's much simpler than that. 559 00:30:30,409 --> 00:30:34,369 It's just, if the value is greater than what comes out 560 00:30:34,369 --> 00:30:37,445 of this matrix multiplied, if the value is greater 561 00:30:37,445 --> 00:30:38,970 than zero, don't touch it. 562 00:30:38,970 --> 00:30:40,190 Just let it pass through. 563 00:30:40,190 --> 00:30:43,259 If it's less than zero, make it zero, right? 564 00:30:43,259 --> 00:30:46,812 You know, it's a pretty complicated name 565 00:30:46,812 --> 00:30:49,350 for a very simple function. 566 00:30:49,350 --> 00:30:50,750 That's actually critical. 567 00:30:50,750 --> 00:30:53,290 If you didn't have that h function, 568 00:30:53,290 --> 00:30:55,467 this nonlinear function there, then we 569 00:30:55,467 --> 00:30:57,722 could roll up all of these together 570 00:30:57,722 --> 00:31:00,399 and we would just have one big matrix equation, right? 571 00:31:00,399 --> 00:31:03,490 So that's really considered a pretty important part of it. 572 00:31:03,490 --> 00:31:05,415 So that's pretty much what's going on. 573 00:31:05,415 --> 00:31:07,998 When you want to know what the big deal is of neural networks, 574 00:31:07,998 --> 00:31:09,940 that's all that's going on. 575 00:31:09,940 --> 00:31:12,480 It's just that equation. 576 00:31:12,480 --> 00:31:16,870 The challenge is we don't know what the W's and the b's are. 577 00:31:16,870 --> 00:31:19,559 And we don't know how many layers there should be. 578 00:31:19,559 --> 00:31:23,750 And we don't know how many neurons there 579 00:31:23,750 --> 00:31:26,370 should be in each layer. 580 00:31:26,370 --> 00:31:28,776 And although the features can be arbitrary, 581 00:31:28,776 --> 00:31:30,449 picking the right ones do matter. 582 00:31:30,449 --> 00:31:32,240 And picking the right categories do matter. 583 00:31:32,240 --> 00:31:35,355 So when people talk about, I do machine learning 584 00:31:35,355 --> 00:31:37,779 or I'm off working on-- they're basically 585 00:31:37,779 --> 00:31:39,960 playing with all of these parameters 586 00:31:39,960 --> 00:31:42,869 to try and find the ones that will 587 00:31:42,869 --> 00:31:45,470 work best for their problem. 588 00:31:45,470 --> 00:31:47,100 And there's a lot of trial and error. 589 00:31:47,100 --> 00:31:48,532 And you'll hear about there's now 590 00:31:48,532 --> 00:31:50,680 systems that try and use machine learning to do 591 00:31:50,680 --> 00:31:51,940 that process automatically. 592 00:31:51,940 --> 00:31:56,851 You know, how do you make machines that learn 593 00:31:56,851 --> 00:31:59,580 how to do machine learning? 594 00:31:59,580 --> 00:32:03,580 The basic approach is a trial and error approach. 595 00:32:03,580 --> 00:32:07,643 I take a whole bunch of pictures of cats 596 00:32:07,643 --> 00:32:13,610 that I now have cats in them, OK, and other things, right? 597 00:32:13,610 --> 00:32:18,769 And I randomly set all those weights and thresholds. 598 00:32:18,769 --> 00:32:22,549 And I put in the vector and I see what the system-- 599 00:32:22,549 --> 00:32:26,078 I guess what I think the number of layers and neurons 600 00:32:26,078 --> 00:32:30,111 and all that should be and I run it through the system 601 00:32:30,111 --> 00:32:33,013 and I get an estimate or a calculation 602 00:32:33,013 --> 00:32:36,422 for what I think these final values should be 603 00:32:36,422 --> 00:32:38,649 and I compare it with the truth. 604 00:32:38,649 --> 00:32:40,960 That is, I just basically subtract it. 605 00:32:40,960 --> 00:32:46,317 And then I use those corrections to very carefully adjust 606 00:32:46,317 --> 00:32:47,389 the weights. 607 00:32:47,389 --> 00:32:49,919 Basically, with the last weights first, I 608 00:32:49,919 --> 00:32:53,486 do what's called back propagate these little changes to try 609 00:32:53,486 --> 00:32:57,289 and make a better guess on what these weights should be. 610 00:32:57,289 --> 00:32:59,977 So if you hear the term back propagation, 611 00:32:59,977 --> 00:33:02,330 that's that process of taking those differences 612 00:33:02,330 --> 00:33:07,210 and using them to adjust these weights by about 613 00:33:07,210 --> 00:33:09,379 0.01% at a time. 614 00:33:09,379 --> 00:33:12,321 And then we just do this over and over 615 00:33:12,321 --> 00:33:15,263 again until eventually we get a set of weights 616 00:33:15,263 --> 00:33:19,310 that we think does the problem well enough for our purpose. 617 00:33:19,310 --> 00:33:21,899 So that's called back propagation, all right? 618 00:33:21,899 --> 00:33:23,604 Once we have the set of weights and we 619 00:33:23,604 --> 00:33:25,500 have a new picture that we want to know what 620 00:33:25,500 --> 00:33:28,233 it is, we drop it in there and it tells us 621 00:33:28,233 --> 00:33:30,269 it's a cat or a dog or whatever. 622 00:33:30,269 --> 00:33:32,100 That forward step is called inference. 623 00:33:32,100 --> 00:33:34,442 These are two words you'll hear frequently 624 00:33:34,442 --> 00:33:37,200 in machine learning, back propagation and inference. 625 00:33:37,200 --> 00:33:38,450 And that's all there is to it. 626 00:33:38,450 --> 00:33:40,970 There's really nothing else to that. 627 00:33:40,970 --> 00:33:42,867 If you can understand this equation, 628 00:33:42,867 --> 00:33:46,029 you'll be way ahead of most people in machine learning, 629 00:33:46,029 --> 00:33:47,029 you know? 630 00:33:47,029 --> 00:33:49,028 You know, there's lots of people who 631 00:33:49,028 --> 00:33:51,314 understand about all the software and the packages 632 00:33:51,314 --> 00:33:52,600 and the data. 633 00:33:52,600 --> 00:33:54,529 All of them are just doing that. 634 00:33:54,529 --> 00:33:58,027 And I'd say this is one of the most powerful ways 635 00:33:58,027 --> 00:34:01,470 to be ahead in your field, is to actually understand 636 00:34:01,470 --> 00:34:03,210 the mathematical principles. 637 00:34:03,210 --> 00:34:05,944 Because then the software and what it's doing 638 00:34:05,944 --> 00:34:06,970 is much clearer. 639 00:34:06,970 --> 00:34:08,809 And other people who don't understand 640 00:34:08,809 --> 00:34:11,100 these mathematical principles, they're really guessing. 641 00:34:11,100 --> 00:34:12,968 They're like, oh, well, I do this 642 00:34:12,968 --> 00:34:14,570 and I throw this module in. 643 00:34:14,570 --> 00:34:17,838 They don't really know that all it's doing 644 00:34:17,838 --> 00:34:20,699 is making adjustments to these various equations, 645 00:34:20,699 --> 00:34:24,230 how many different layers there are and stuff like that. 646 00:34:24,230 --> 00:34:25,996 Now, why is this important? 647 00:34:25,996 --> 00:34:27,620 You're like, well, what does it matter? 648 00:34:27,620 --> 00:34:29,909 As I said before, we have this system. 649 00:34:29,909 --> 00:34:32,000 It works but we don't know why. 650 00:34:32,000 --> 00:34:34,389 Well, why is it important to know why? 651 00:34:34,389 --> 00:34:36,139 Well, there's two reasons. 652 00:34:36,139 --> 00:34:40,519 One is that, if we want to be able to apply 653 00:34:40,519 --> 00:34:43,962 this incredible innovation to other domains-- so many of you 654 00:34:43,962 --> 00:34:45,280 probably want to do that. 655 00:34:45,280 --> 00:34:48,738 Many of you want to say, how can I apply machine 656 00:34:48,738 --> 00:34:52,217 learning to something else other than language or vision 657 00:34:52,217 --> 00:34:56,690 or some of these other standard problems? 658 00:34:56,690 --> 00:34:58,570 I kind of need some theory to know. 659 00:34:58,570 --> 00:35:01,081 Like, OK, if I have a problem that's like this one over here 660 00:35:01,081 --> 00:35:03,228 and I changed it in this way, there's 661 00:35:03,228 --> 00:35:05,700 a good chance it'll work. 662 00:35:05,700 --> 00:35:10,169 There's some basis for why I'm going to try something, right? 663 00:35:10,169 --> 00:35:11,960 Right now there's a lot of trial and error. 664 00:35:11,960 --> 00:35:13,209 It's like, well, it's an idea. 665 00:35:13,209 --> 00:35:15,385 But if you can have some math that says, 666 00:35:15,385 --> 00:35:17,460 you know, I think that will probably work, 667 00:35:17,460 --> 00:35:21,091 that really is a great way to guide your reasoning 668 00:35:21,091 --> 00:35:22,590 and guide your efforts. 669 00:35:22,590 --> 00:35:25,928 Another reason is that-- so here's 670 00:35:25,928 --> 00:35:30,380 a picture of a very cute poodle, right? 671 00:35:30,380 --> 00:35:34,214 And the machine learning system correctly 672 00:35:34,214 --> 00:35:36,770 identifies it as poodle. 673 00:35:36,770 --> 00:35:39,430 One thing we realized is that the way you 674 00:35:39,430 --> 00:35:42,520 and I see that picture is actually very, very different 675 00:35:42,520 --> 00:35:47,260 than the way the neural network sees that picture, all right? 676 00:35:47,260 --> 00:35:51,167 And, in fact, I can make changes to that picture that 677 00:35:51,167 --> 00:35:54,321 are imperceptible to you or me but will completely 678 00:35:54,321 --> 00:35:56,189 change how the neural network-- that 679 00:35:56,189 --> 00:36:00,428 is, given our neural network, I can basically make it 680 00:36:00,428 --> 00:36:03,050 think anything, right? 681 00:36:03,050 --> 00:36:05,760 And so, for instance, this is a famous paper. 682 00:36:05,760 --> 00:36:08,508 And they got the system to think that that 683 00:36:08,508 --> 00:36:09,730 was an ostrich, right? 684 00:36:09,730 --> 00:36:12,890 And you can basically show this for anything, right? 685 00:36:12,890 --> 00:36:15,980 So what's called robust AI, or robust machine learning, 686 00:36:15,980 --> 00:36:18,040 machine learning that can't be tricked, 687 00:36:18,040 --> 00:36:21,030 is going to become more and more important. 688 00:36:21,030 --> 00:36:25,188 And again, having a deeper understanding of the theory 689 00:36:25,188 --> 00:36:27,960 is very, very critical of that. 690 00:36:27,960 --> 00:36:31,850 So how are we going to do this? 691 00:36:31,850 --> 00:36:34,015 What's the main concept that we are going 692 00:36:34,015 --> 00:36:35,640 to go through in this class? 693 00:36:35,640 --> 00:36:37,770 This has mostly been motivational. 694 00:36:37,770 --> 00:36:40,320 But how are we going to understand the data at a deeper 695 00:36:40,320 --> 00:36:40,820 level? 696 00:36:40,820 --> 00:36:42,770 You know, what's the big idea? 697 00:36:42,770 --> 00:36:46,179 And the big idea is captured now in this, 698 00:36:46,179 --> 00:36:48,831 I apologize for this eye chart slide, 699 00:36:48,831 --> 00:36:52,277 which is what we call declarative mathematically 700 00:36:52,277 --> 00:36:53,300 rigorous data. 701 00:36:53,300 --> 00:36:56,504 So we have this mathematical concept 702 00:36:56,504 --> 00:36:58,640 called an associative array. 703 00:36:58,640 --> 00:37:01,960 And it's corresponding algebra that basically encompasses 704 00:37:01,960 --> 00:37:05,763 the data you would put in databases, the data 705 00:37:05,763 --> 00:37:07,938 that you would put in graphs, the data that 706 00:37:07,938 --> 00:37:11,560 would put in matrices and it makes it all a linear system. 707 00:37:11,560 --> 00:37:14,240 And the key operations are outlined there at the bottom. 708 00:37:14,240 --> 00:37:17,479 If you recall, we have our basic little addition 709 00:37:17,479 --> 00:37:18,270 and multiplication. 710 00:37:18,270 --> 00:37:20,462 And then what's going to be very important, 711 00:37:20,462 --> 00:37:22,707 probably the real workhorse for this-- and i 712 00:37:22,707 --> 00:37:24,724 didn't show it before-- is called essentially 713 00:37:24,724 --> 00:37:26,640 array multiplication or matrix multiplication. 714 00:37:26,640 --> 00:37:29,294 And that's the far one on the right 715 00:37:29,294 --> 00:37:33,354 there, which we often abbreviate just with no symbol, just A B. 716 00:37:33,354 --> 00:37:36,578 But if we really want to explicitly call out 717 00:37:36,578 --> 00:37:39,037 that its matrix multiplication as a combination 718 00:37:39,037 --> 00:37:40,706 of both multiplication and addition, 719 00:37:40,706 --> 00:37:44,210 we put in what we call the punch-drunk emoji, which 720 00:37:44,210 --> 00:37:46,710 is a plus dot times. 721 00:37:46,710 --> 00:37:50,370 You're probably all young enough that you don't even 722 00:37:50,370 --> 00:37:53,585 remember emojis when they had to type them out 723 00:37:53,585 --> 00:37:56,170 with just little characters and we didn't have icons, right? 724 00:37:56,170 --> 00:38:00,620 So that meant you went to the bar and lost to the fight, 725 00:38:00,620 --> 00:38:01,120 right? 726 00:38:01,120 --> 00:38:04,616 But, anyway, that's really going to be the workhorse of what 727 00:38:04,616 --> 00:38:06,502 we're doing here.