1 00:00:00,680 --> 00:00:02,120 [SQUEAKING] 2 00:00:02,120 --> 00:00:04,340 [PAPERS RUSTLING] 3 00:00:04,340 --> 00:00:06,320 [CLICKING] 4 00:00:15,600 --> 00:00:25,040 JEREMY KEPNER: All right. Welcome. Great to see everyone here. We're really excited about 5 00:00:25,040 --> 00:00:32,229 this opportunity. As you know, our AI accelerator has officially kicked off. All of your teams 6 00:00:32,229 --> 00:00:41,559 are ready to go. And we wanted this to be an opportunity as a team, come together and 7 00:00:41,559 --> 00:00:47,440 develop some common foundation, some common technological foundation, some common language 8 00:00:47,440 --> 00:00:50,260 for talking about these very challenging AI problems. 9 00:00:50,260 --> 00:00:56,260 And so with that, I'll hand it over to Vijay. 10 00:00:56,260 --> 00:00:57,400 VIJAY GADEPALLY: All right. 11 00:00:57,400 --> 00:01:01,040 JEREMY KEPNER: Who will kick off with the first lecture, which basically provides some overview 12 00:01:01,049 --> 00:01:03,170 AI context for this. 13 00:01:03,170 --> 00:01:07,660 VIJAY GADEPALLY: Again, welcome to the class. We're really looking forward to this. What 14 00:01:07,660 --> 00:01:12,600 we're going to present this morning is really a lot of overview material, right? Many of 15 00:01:12,600 --> 00:01:19,430 you here know a lot in AI and machine learning. This is really meant to just level set before 16 00:01:19,430 --> 00:01:22,890 we start the program, before we start these classes. 17 00:01:22,890 --> 00:01:27,580 So you can see this generic title-- Artificial Intelligence and Machine Learning. And we're 18 00:01:27,580 --> 00:01:32,390 going to try and cover all of that in about an hour. So some details might be skipped, 19 00:01:32,390 --> 00:01:38,310 but we'll try and hit some of the salient features. All of these slides are available 20 00:01:38,310 --> 00:01:39,310 for you to use. 21 00:01:39,310 --> 00:01:44,289 So if you're presenting back to your own teams, please feel free to pull from these slides. 22 00:01:44,289 --> 00:01:49,039 We've actually gone through-- over some time putting a good set of survey and overview 23 00:01:49,039 --> 00:01:54,280 slides together. So if any of these are useful to you, just email us, or we'll make them 24 00:01:54,280 --> 00:01:58,280 available to you. You're more than welcome to use any and all of these slides if you're 25 00:01:58,280 --> 00:02:01,590 trying to present this back to other people. 26 00:02:01,590 --> 00:02:06,810 So with that, let's begin. So we're going to do a quick overview of artificial intelligence. 27 00:02:06,810 --> 00:02:11,890 Again, a lot of level setting going on here. We're going to do a quick, deep dive. These 28 00:02:11,890 --> 00:02:16,420 aren't the deepest of dives, again, given the amount of time that we have, but just 29 00:02:16,420 --> 00:02:21,170 talk very quickly about supervised, unsupervised, and reinforcement learning, and then summarize. 30 00:02:21,170 --> 00:02:26,730 And we can certainly stop for questions, philosophical debates, et cetera, towards the end. We'll 31 00:02:26,730 --> 00:02:32,450 try not to get a lot of the philosophical debates on camera if we can. All right. So 32 00:02:32,450 --> 00:02:37,040 first question-- what is artificial intelligence? And this is a question that probably a lot 33 00:02:37,040 --> 00:02:41,550 of you get. And I certainly have received this from a number of people. And that actually 34 00:02:41,550 --> 00:02:45,070 takes a lot of-- it took us a lot of time to come up with this. 35 00:02:45,070 --> 00:02:50,680 And so we are very fortunate to have Professor Winston spend some time with us out at Lincoln 36 00:02:50,680 --> 00:02:54,590 Laboratory. And we actually brainstormed for a good hour or two, really trying to come 37 00:02:54,590 --> 00:03:00,480 up with what is a good definition for what we call artificial intelligence. And what 38 00:03:00,480 --> 00:03:04,740 we came up with is that there are two aspects to artificial intelligence. 39 00:03:04,740 --> 00:03:10,270 First, that we should not confuse with each other. One is the concept of narrow AI, and 40 00:03:10,270 --> 00:03:14,640 another is a concept of general AI. And sometimes in conversation, we tend to conflate or mix 41 00:03:14,640 --> 00:03:20,950 the two. So narrow AI, according to our definition, is the theory and development of computer 42 00:03:20,950 --> 00:03:26,570 systems that perform tasks that augment for human intelligence, such as perceiving, classifying, 43 00:03:26,570 --> 00:03:30,050 learning, abstracting, reasoning, and/or acting. 44 00:03:30,050 --> 00:03:34,690 Certainly in a lot of the programs that we work in, we're very focused on narrow AI and 45 00:03:34,690 --> 00:03:42,010 not necessarily the more general AI, which we define as full autonomy. So that's a very 46 00:03:42,010 --> 00:03:46,840 high-level definition of what we mean by AI. Now, many of you in the crowd are probably 47 00:03:46,840 --> 00:03:51,370 saying, well, AI has been around for a while. People have been talking about this for 50, 48 00:03:51,370 --> 00:03:58,209 60-plus years. Why now? What is so special about it now? Why is this conversation piece 49 00:03:58,209 --> 00:03:59,209 now? 50 00:03:59,209 --> 00:04:04,090 Well, from what we've seen, it really is the convergence of three different communities 51 00:04:04,090 --> 00:04:08,550 that have come together. The first is the community on big data. The second is a community 52 00:04:08,550 --> 00:04:14,360 on computing in a lot of computing technologies. And finally, a lot of research and results 53 00:04:14,360 --> 00:04:16,279 in machine learning algorithms. 54 00:04:16,279 --> 00:04:20,220 The other one I forgot to put is dollar signs down here. People have basically figured out 55 00:04:20,220 --> 00:04:27,800 how to make money off of selling advertisements, labeling cat pictures, et cetera. So that's 56 00:04:27,800 --> 00:04:33,930 maybe the hidden-- why now in particular. But these are the three large technical areas 57 00:04:33,930 --> 00:04:40,010 that have evolved over the past decade or so to really make AI something we discuss 58 00:04:40,010 --> 00:04:42,370 a lot today. 59 00:04:42,370 --> 00:04:48,659 So when we talk about AI, there are a number of different pieces which make up an AI system. 60 00:04:48,659 --> 00:04:53,650 And we love the algorithms, people, but there is a lot more going on outside of that. So 61 00:04:53,650 --> 00:04:57,900 we've spent a significant amount of effort just trying to figure out what goes into an 62 00:04:57,900 --> 00:04:58,900 AI system. 63 00:04:58,900 --> 00:05:03,930 And this is what we call a canonical architecture. Very much in line with Lincoln Laboratory 64 00:05:03,930 --> 00:05:07,770 thinking, we like to think of an end to end pipeline. What are the various components? 65 00:05:07,770 --> 00:05:11,800 And what are the interconnections between these various components? So within our AI 66 00:05:11,800 --> 00:05:18,220 canonical architecture shown here, we go all the way from sensors to the end user or missions. 67 00:05:18,220 --> 00:05:22,100 And a lot of the projects that you all are working on are going to go all the way from 68 00:05:22,100 --> 00:05:27,460 here to there. A lot of our class, however, for the next few weeks is going to focus on 69 00:05:27,460 --> 00:05:33,050 step one, where a lot of people get stuck. So we take data that comes in through either 70 00:05:33,050 --> 00:05:36,050 structured or unstructured sources. 71 00:05:36,050 --> 00:05:41,520 These are typically passed into some data conditioning or data curation step. This data 72 00:05:41,520 --> 00:05:48,320 is through that process, typically converted into some form of information. That information 73 00:05:48,320 --> 00:05:54,270 is then passed into a series of algorithms, maybe one or many algorithms. There are lots 74 00:05:54,270 --> 00:05:57,610 of them. There is life beyond neural networks. 75 00:05:57,610 --> 00:06:01,669 Once we pass them through the algorithms, these typically form. This information is 76 00:06:01,669 --> 00:06:06,490 converted into knowledge. It's typically then passed into some module that interacts with 77 00:06:06,490 --> 00:06:11,449 the end user, or a human, or the mission. And that's what we call a human machine teaming 78 00:06:11,449 --> 00:06:12,449 step. 79 00:06:12,449 --> 00:06:17,400 And that finally-- that knowledge with the human complement becomes insight that can 80 00:06:17,400 --> 00:06:24,110 then be used to execute the mission that the AI system was created for. All of these components 81 00:06:24,110 --> 00:06:30,440 sit on the bedrock of modern computing. Many different technologies that make up modern 82 00:06:30,440 --> 00:06:34,690 computing and the system that we're using today has combination of some of these computing 83 00:06:34,690 --> 00:06:36,900 hardware elements. 84 00:06:36,900 --> 00:06:41,550 And certainly within the context of a lot of the projects that we are interested in, 85 00:06:41,550 --> 00:06:47,050 all of this also needs to be wrapped in a layer that we call robust AI, which consists 86 00:06:47,050 --> 00:06:51,930 of explainable artificial intelligence, metrics, and biased assessment, verification, validation, 87 00:06:51,930 --> 00:06:57,250 security, policy, ethics, safety, and training. We'll talk very briefly about each of these 88 00:06:57,250 --> 00:07:00,169 pieces in detail in a little bit. 89 00:07:00,169 --> 00:07:07,729 As I mentioned, AI has an extremely rich history. This is just a very Lincoln and MIT specific 90 00:07:07,729 --> 00:07:13,889 view of the history of artificial intelligence. But certainly, there has been great work since 91 00:07:13,889 --> 00:07:20,350 the folks of Minsky, Clark, Dineen, Oliver Selfridge, et cetera, since the '50s. We've 92 00:07:20,350 --> 00:07:25,900 seen a lot of work in the '80s and '90s. And certainly, recently there has been, again, 93 00:07:25,900 --> 00:07:31,570 a resurgence of AI in our parlance in our thinking of the way AI works. 94 00:07:31,570 --> 00:07:36,650 So without going into too much detail about each of these eras and why the winters came 95 00:07:36,650 --> 00:07:43,389 about, et cetera, I think John Launchbury at DARPA actually put it very well, when he 96 00:07:43,389 --> 00:07:48,770 talked about different waves of AI technology that have come about. And when he talks about 97 00:07:48,770 --> 00:07:53,380 it, he talks about the three waves of AI or the four waves of AI. And the first wave, 98 00:07:53,380 --> 00:08:00,910 which you can think of as the first decade of AI technology, resulted in a lot of reasoning-based 99 00:08:00,910 --> 00:08:04,250 systems, which were based on handcrafted knowledge. 100 00:08:04,250 --> 00:08:08,830 So an example of an output of this would be an expert system, right? So a lot of work 101 00:08:08,830 --> 00:08:17,370 in that. So if we take the four dimensions that John Launchbury suggests of the ability 102 00:08:17,370 --> 00:08:21,840 of the system to perceive, learn, abstract, and reason, these are typically pretty good 103 00:08:21,840 --> 00:08:24,330 at reasoning. Because they encoded human knowledge, right? 104 00:08:24,330 --> 00:08:32,539 So a human expert sat down and said, what's going on in the system? And tried to write 105 00:08:32,539 --> 00:08:37,549 a series of rules. So tax software, for example, does a pretty reasonable job of that where 106 00:08:37,549 --> 00:08:43,460 a chartered accountant or a tax expert sits down, encodes a series of rules. We have a 107 00:08:43,460 --> 00:08:44,460 question in the back. 108 00:08:44,460 --> 00:08:50,580 AUDIENCE: Yeah. [INAUDIBLE] I just wanted an example of an expert system from the '50s 109 00:08:50,580 --> 00:08:51,580 to [INAUDIBLE]. 110 00:08:51,580 --> 00:08:56,230 VIJAY GADEPALLY: Yep. So the question is, are there examples of expert systems? And 111 00:08:56,230 --> 00:09:01,600 certainly, one would be tax software. My graduate research was actually an autonomous vehicle. 112 00:09:01,600 --> 00:09:07,370 Some of the early autonomous vehicles used a form of expert systems where the states 113 00:09:07,370 --> 00:09:10,950 on a finite state machine were maybe handcrafted. 114 00:09:10,950 --> 00:09:14,790 And the transitions between them were designed that way. There was some machine learning 115 00:09:14,790 --> 00:09:19,709 wrapped around, but expert systems certainly played a large part in some of the early, 116 00:09:19,709 --> 00:09:27,230 even autonomous vehicle research that went on. All right. So over time, we were able 117 00:09:27,230 --> 00:09:28,770 to use these expert systems. 118 00:09:28,770 --> 00:09:32,611 And don't get me wrong. These systems are still extremely valid in cases where you have 119 00:09:32,611 --> 00:09:37,320 limited data availability, limited compute power, a lot of expert systems still being 120 00:09:37,320 --> 00:09:43,580 used, or in cases where explainability is a very important factor. You still see expert 121 00:09:43,580 --> 00:09:47,680 systems, because they do have the ability to explain why they came up. They can typically 122 00:09:47,680 --> 00:09:53,090 point to a set of rules that somebody wrote, which is usually quite interpretable by a 123 00:09:53,090 --> 00:09:54,090 human. 124 00:09:54,090 --> 00:09:58,480 However, as we were able to collect more data, we were maybe able to understand a little 125 00:09:58,480 --> 00:10:05,300 bit more about what was the underlying process. We were able to apply statistical learning. 126 00:10:05,300 --> 00:10:12,410 And this led to the next era or next wave of AI technologies, which is often called 127 00:10:12,410 --> 00:10:13,899 the learning wave. 128 00:10:13,899 --> 00:10:18,970 And this was really enabled by lots of data-enabled non-expert systems. So what we mean by that 129 00:10:18,970 --> 00:10:24,530 is we were able to dial back the amount of expert knowledge that we encoded into the 130 00:10:24,530 --> 00:10:30,890 algorithm, maybe put a higher level of expert knowledge into that. But usually, then used 131 00:10:30,890 --> 00:10:33,790 data to learn what some of these rules could be. 132 00:10:33,790 --> 00:10:39,870 An example of that, in case someone wants to ask, would be in speech processing, for 133 00:10:39,870 --> 00:10:48,550 example. So we were able to say, well, I realized that speech follows this Gaussian mixture 134 00:10:48,550 --> 00:10:53,450 model. So I can encode that level of statistical knowledge, but I'm going to let the system 135 00:10:53,450 --> 00:10:57,339 figure out the details of how all that actually works out. 136 00:10:57,339 --> 00:11:01,880 And there are many other cases. Again, coming back to some of the research I did on autonomous 137 00:11:01,880 --> 00:11:06,660 vehicles, we were able to maybe use some high-level expert rules, that here are a set of states 138 00:11:06,660 --> 00:11:12,089 that a car may be in. But I'm going to let the algorithm actually figure out when these 139 00:11:12,089 --> 00:11:17,570 transitions occur and what constitutes a transition between different states. 140 00:11:17,570 --> 00:11:23,290 So looking at the four vectors that you could think about it, these systems had a little 141 00:11:23,290 --> 00:11:28,540 bit more on perception. Obviously, we're doing a lot more learning. But their ability to 142 00:11:28,540 --> 00:11:33,839 abstract and reason was still pretty low. And by reasoning, we mean, can you explain? 143 00:11:33,839 --> 00:11:38,660 Can you tell us what's going on when you give me an output to the result? 144 00:11:38,660 --> 00:11:46,420 The next wave which we're maybe at the beginning stages of is what we call contextual learning 145 00:11:46,420 --> 00:11:51,880 or contextual adaptation. This is where an AI system can actually add context into what 146 00:11:51,880 --> 00:11:57,140 it's doing. I'm not sure I have too many examples of people doing this very well. 147 00:11:57,140 --> 00:12:05,420 I think most of the work today probably falls into the end stage of the learning wave of 148 00:12:05,420 --> 00:12:09,950 AI. But we're able to combine a bunch of these learning things to make it look like it's 149 00:12:09,950 --> 00:12:16,079 contextual in nature. But the key concept over here is being able to have the system 150 00:12:16,079 --> 00:12:20,350 automatically abstract and reason. So the way that we think about things, right? 151 00:12:20,350 --> 00:12:24,459 So if I see a chair over here and I put a chair somewhere else, I still know it's a 152 00:12:24,459 --> 00:12:29,760 chair, because I'm using other contexts. Maybe it's next to a table or stuff like that. Some 153 00:12:29,760 --> 00:12:35,220 early research going on in that area. And certainly, the next wave of this is what we 154 00:12:35,220 --> 00:12:39,230 call abstraction. And there is very little work on this, but if we have to think out 155 00:12:39,230 --> 00:12:40,260 in the future. 156 00:12:40,260 --> 00:12:46,649 But this is really the system of the ability of an AI system to actually abstract information 157 00:12:46,649 --> 00:12:51,870 that it's learning. So instead of learning that a chair or a table is something with 158 00:12:51,870 --> 00:12:56,480 a leg at the bottom, it learns that a table is something you put things on and is able 159 00:12:56,480 --> 00:13:04,350 to abstract that information or that knowledge to any other domain or any other field. Do 160 00:13:04,350 --> 00:13:13,300 we have any questions before I continue from here? OK, great. 161 00:13:13,300 --> 00:13:18,120 So that's a little bit on the evolution of AI. The reason we like to go through this 162 00:13:18,120 --> 00:13:23,860 is because there is great work going on in each of these waves. And nothing that people 163 00:13:23,860 --> 00:13:28,820 are doing in any of these waves is any lesser or more. It's typically dependent on what 164 00:13:28,820 --> 00:13:31,100 you have at your disposal. 165 00:13:31,100 --> 00:13:35,280 What I like to tell people sometimes is the way to think about all of this is you have 166 00:13:35,280 --> 00:13:41,570 a couple of dials at your disposal-- turning dials. The first dial is, how much compute? 167 00:13:41,570 --> 00:13:47,870 Right? How much ability do you have to crunch data? The second piece is, how much data do 168 00:13:47,870 --> 00:13:51,540 you actually have available? This can be labeled data in many cases. 169 00:13:51,540 --> 00:13:56,471 And the third dial is, how much knowledge are you able to embed into an algorithm? In 170 00:13:56,471 --> 00:14:02,370 certain cases where maybe you have very little computing, very little labeled data availability, 171 00:14:02,370 --> 00:14:07,670 but a lot of expert knowledge or a lot of ability to encode information into an algorithm, 172 00:14:07,670 --> 00:14:11,720 you might be able to use an expert system, right? And that's a very good use case for 173 00:14:11,720 --> 00:14:12,720 that. 174 00:14:12,720 --> 00:14:17,870 An example may be on another dimension where you want to encode very little human knowledge, 175 00:14:17,870 --> 00:14:22,529 but you have a lot of computing and data available, would be rare neural networks fall in, where 176 00:14:22,529 --> 00:14:27,360 they're essentially learning what the human knowledge-- what that encoded information 177 00:14:27,360 --> 00:14:28,360 should be. 178 00:14:28,360 --> 00:14:32,170 A lot of statistical techniques also fall into that camp where maybe you encode a little 179 00:14:32,170 --> 00:14:37,519 bit of information as to what the background distribution of the process is. But it learns 180 00:14:37,519 --> 00:14:44,170 the details of exactly how that distribution is modeled, based on the data that it sees. 181 00:14:44,170 --> 00:14:48,230 So you have a lot of different settings that you can use. And there are a number of different 182 00:14:48,230 --> 00:14:55,230 techniques within the broader AI context that you can use to achieve your mission. And I'm 183 00:14:55,230 --> 00:15:01,050 sure many of you are going to be doing different types of algorithms. And a lot of that decision 184 00:15:01,050 --> 00:15:06,139 will be dependent on, well, how much data was I given? Right? How good is this data 185 00:15:06,139 --> 00:15:10,070 that I'm using? Is there an ability to learn anything from this? 186 00:15:10,070 --> 00:15:13,670 And if not, you might have to encode some knowledge of your own into it saying, well, 187 00:15:13,670 --> 00:15:19,649 I know that this process looks like that, so let me tell the algorithm to not waste 188 00:15:19,649 --> 00:15:23,330 too much time crunching the data to learn the underlying distribution, which I can tell 189 00:15:23,330 --> 00:15:28,279 you. Why don't you learn the parameters of the distribution instead? Does that makes 190 00:15:28,279 --> 00:15:29,949 sense? All right. 191 00:15:29,949 --> 00:15:35,680 And as you know, there's just a lot going on in AI and machine learning. You can't walk 192 00:15:35,680 --> 00:15:41,410 two steps without running into somebody who's either starting something up, working for 193 00:15:41,410 --> 00:15:48,250 one of these organizations. So it really is an exciting time to be in the field. All right. 194 00:15:48,250 --> 00:15:53,500 So that's a little bit on the overview, but let's talk now in a little bit of detail on 195 00:15:53,500 --> 00:16:00,070 what some of the critical components are within this AI architecture. So one thing that we 196 00:16:00,070 --> 00:16:03,360 like to note-- and there's a reason that as we've been reaching out to a number of you, 197 00:16:03,360 --> 00:16:08,820 we've been talking about getting data, right? Work with stakeholders to get your data in 198 00:16:08,820 --> 00:16:09,820 place. 199 00:16:09,820 --> 00:16:15,330 And the reason we talk about that is data is critical to breakthroughs in AI. A lot 200 00:16:15,330 --> 00:16:19,870 of the press may be on the algorithms and the algorithms that have been designed. But 201 00:16:19,870 --> 00:16:23,880 really, when we've looked back in history, we've seen that, well, the availability of 202 00:16:23,880 --> 00:16:31,769 a good canonical data set actually is equally, if not more critical to a breakthrough in 203 00:16:31,769 --> 00:16:32,769 AI. 204 00:16:32,769 --> 00:16:38,011 So what we've done here is we've just picked a select number of breakthroughs in AI. Our 205 00:16:38,011 --> 00:16:42,800 definition of a breakthrough in this particular example is something that made a lot of press 206 00:16:42,800 --> 00:16:47,760 or something that we thought was really cool. So here are some examples of that in different 207 00:16:47,760 --> 00:16:49,570 years. 208 00:16:49,570 --> 00:16:54,160 And we've talked about the data set, the canonical data set that maybe led to that breakthrough 209 00:16:54,160 --> 00:16:58,600 or that was cited in that breakthrough as well as the algorithms that were used in that 210 00:16:58,600 --> 00:17:04,579 breakthrough and when they were first proposed. This is notional in nature. Clearly, you could 211 00:17:04,579 --> 00:17:07,730 adjust these dates a few years here or there. 212 00:17:07,730 --> 00:17:12,829 But what we really want to get across is that the average number of years to a breakthrough 213 00:17:12,829 --> 00:17:18,618 from the availability of a cool data set or a very important, well-structured, well-labeled 214 00:17:18,618 --> 00:17:25,119 data set is much smaller than from the algorithm's first proposal or when the algorithm first 215 00:17:25,119 --> 00:17:26,209 comes out. 216 00:17:26,209 --> 00:17:30,889 So as you're developing your challenge problems, as you're developing your interactions with 217 00:17:30,889 --> 00:17:37,029 stakeholders, certainly something to keep in mind that there's clearly a lot of algorithmic 218 00:17:37,029 --> 00:17:44,149 research that's going to go on. But having a good, strong, well-labeled, and documented 219 00:17:44,149 --> 00:17:49,210 data set can be equally important. And making that available to the wider AI community, 220 00:17:49,210 --> 00:17:56,009 the wider AI ecosystem can be very, very valuable to your work and the work of many other people. 221 00:17:56,009 --> 00:17:57,389 All right. 222 00:17:57,389 --> 00:18:02,100 So back to the AI architecture, we're going to just go through very briefly-- different 223 00:18:02,100 --> 00:18:07,590 pieces of this architecture. So the first piece we're going to talk about is data conditioning, 224 00:18:07,590 --> 00:18:13,600 which is converting unstructured and structured data. Within the structured and unstructured 225 00:18:13,600 --> 00:18:19,779 data, you might have structured sources coming from sensors, network logs for some of you, 226 00:18:19,779 --> 00:18:25,559 metadata associated with sensors, maybe speech or other such signals. 227 00:18:25,559 --> 00:18:29,590 There's also a lot of unstructured data. You think of things that you might collect from 228 00:18:29,590 --> 00:18:34,800 the internet that you might download from, say, a social media site, maybe reports, other 229 00:18:34,800 --> 00:18:40,019 types of sensors that maybe don't have well-- that don't have the strong structure within 230 00:18:40,019 --> 00:18:41,509 the data set itself. 231 00:18:41,509 --> 00:18:46,710 And typically, this first step, what we call the data conditioning step, consists of a 232 00:18:46,710 --> 00:18:52,070 number of different elements. You might want to first figure out where to put this data. 233 00:18:52,070 --> 00:18:56,889 That can often take a lot of time. And there have been religious wars fought on this topic. 234 00:18:56,889 --> 00:19:02,880 We're here to tell you that you're probably OK, picking most technologies. But if you 235 00:19:02,880 --> 00:19:07,700 have any questions, feel free to reach out to me or to others on the team. We have a 236 00:19:07,700 --> 00:19:16,139 lot of opinions on what's the right infrastructure to solve the problem. Typically, these infrastructure 237 00:19:16,139 --> 00:19:20,960 or databases might provide capabilities such as indexing, organization, and structure. 238 00:19:20,960 --> 00:19:24,840 Very important in unstructured data to convert it into some format that you can do things 239 00:19:24,840 --> 00:19:25,840 with. 240 00:19:25,840 --> 00:19:30,609 They may allow you to connect to them using domain specific languages. So it's converting 241 00:19:30,609 --> 00:19:35,989 it into a language that maybe you're used to talking. They can provide high-performance 242 00:19:35,989 --> 00:19:39,629 data access and in many cases, a declarative interface. Because maybe you don't really 243 00:19:39,629 --> 00:19:43,620 care about how the data is being accessed. You want to just say select the data, give 244 00:19:43,620 --> 00:19:46,659 it to me. And then move forward from there. 245 00:19:46,659 --> 00:19:53,340 Another important part of the data conditioning step is data curation. This unfortunately 246 00:19:53,340 --> 00:20:00,039 will probably take you a very long time. And it requires a lot of knowledge of the data 247 00:20:00,039 --> 00:20:05,940 itself, what you want to do with the data, and how you receive the data. 248 00:20:05,940 --> 00:20:12,879 But what you might do in the data curation step is perform some unsupervised learning, 249 00:20:12,879 --> 00:20:17,850 maybe reduce the dimensionality of your problem. You might do some clustering or pattern recognition 250 00:20:17,850 --> 00:20:22,879 to maybe remove certain pieces of your data or to highlight certain pieces of the data 251 00:20:22,879 --> 00:20:24,249 that look important. 252 00:20:24,249 --> 00:20:30,390 You might do some outlier detection. You might highlight missing values. There's just dot, 253 00:20:30,390 --> 00:20:34,380 dot, dot, et cetera, et cetera, et cetera. A lot goes on in the data curation step. We 254 00:20:34,380 --> 00:20:40,480 could certainly spend hours just talking about that. And the final thing, especially within 255 00:20:40,480 --> 00:20:46,299 the context of supervised machine learning, but even in the world of unsupervised learning 256 00:20:46,299 --> 00:20:48,979 would be spending some time on data labeling, right? 257 00:20:48,979 --> 00:20:54,730 So this is taking data that you've received, typically doing an initial data exploration. 258 00:20:54,730 --> 00:20:59,090 Could be as simple as opening it up in Excel to see what the different columns and rows 259 00:20:59,090 --> 00:21:05,340 look like if that's a suitable place to open it up. You might look for highlight, missing, 260 00:21:05,340 --> 00:21:08,690 or incomplete data, just from your initial data exploration. 261 00:21:08,690 --> 00:21:13,759 You might be able to go back to the data provider or to the sensor and say, can you reorient 262 00:21:13,759 --> 00:21:19,840 the sensors or recapture the data? I noticed that every time you've measured this particular 263 00:21:19,840 --> 00:21:25,100 quantity, it always shows up as 3. I can't imagine that that's correct. Can you go back 264 00:21:25,100 --> 00:21:29,460 and tell me if that sensor is actually working? Or is it actually 3? In which case, you might 265 00:21:29,460 --> 00:21:30,929 want to know that. 266 00:21:30,929 --> 00:21:35,330 And you might look for errors, biases, and collection, of course, on top of the actual 267 00:21:35,330 --> 00:21:39,649 labeling process that you're doing to highlight phenomenology within the data that you'd like 268 00:21:39,649 --> 00:21:46,200 to then look for through your machine learning algorithms. I'll pause for a second. Yes? 269 00:21:46,200 --> 00:21:47,659 AUDIENCE: I have a quick question. 270 00:21:47,659 --> 00:21:48,659 VIJAY GADEPALLY: Yeah? 271 00:21:48,659 --> 00:21:52,490 AUDIENCE: What's the ratio that you see between structured data and unstructured data? 272 00:21:52,490 --> 00:21:56,649 VIJAY GADEPALLY: So the question is, what's the ratio we see between structured and unstructured 273 00:21:56,649 --> 00:22:02,369 data? That's a great question. So the ratio in terms of the volume or the ratio in terms 274 00:22:02,369 --> 00:22:07,499 of what you can do with it? Because those are actually almost the opposite. 275 00:22:07,499 --> 00:22:12,250 So, again, I'm talking about a few data sets that I'm very familiar with. The unstructured 276 00:22:12,250 --> 00:22:19,871 data can often be 90% of the volume. And maybe the 10% is the metadata associated with the 277 00:22:19,871 --> 00:22:25,999 unstructured data. Most of the value, however, comes from the structured data where people 278 00:22:25,999 --> 00:22:30,980 really analyze the crap out of these structured data, because they know how to. 279 00:22:30,980 --> 00:22:38,649 There is certainly a lot of potential within the unstructured data. So when we talk to 280 00:22:38,649 --> 00:22:42,999 people, that's why we talk a lot about infrastructure and databases as being an important first 281 00:22:42,999 --> 00:22:48,799 step. Because if you can just take the unstructured data and put it into a structured or semi-structured 282 00:22:48,799 --> 00:22:51,860 form, that itself can provide a lot of value. 283 00:22:51,860 --> 00:23:00,220 Because very often in problems that we see, the 90% volume of data is largely untapped. 284 00:23:00,220 --> 00:23:04,149 Because people don't know how to get into it, or don't know what to do with it, or it's 285 00:23:04,149 --> 00:23:09,220 not in a form that you can really deal with. So I think next class, we're going to be talking 286 00:23:09,220 --> 00:23:16,789 to you about how to organize your data, strategies for organizing data that can get you a lot 287 00:23:16,789 --> 00:23:26,380 more value out of the unstructured data. Does that answer your question? Yes? 288 00:23:26,380 --> 00:23:32,460 AUDIENCE: [INAUDIBLE] 289 00:23:32,460 --> 00:23:44,359 VIJAY GADEPALLY: So the question is when you apply AI or machine learning techniques to 290 00:23:44,359 --> 00:23:49,759 a problem domain, is it typically a single modality or multiple modalities? I'd say the 291 00:23:49,759 --> 00:23:55,470 answer is both. Certainly, there's a lot of research. And back there, we have Matthew, 292 00:23:55,470 --> 00:24:00,809 who's actually doing research on that right now on how to fuse multiple modalities of 293 00:24:00,809 --> 00:24:01,809 data. 294 00:24:01,809 --> 00:24:06,419 I know a lot of projects that are being discussed here are certainly looking at multiple modalities. 295 00:24:06,419 --> 00:24:16,690 If I had to say as of today, a lot of the work that's out there-- the published work 296 00:24:16,690 --> 00:24:21,490 may be focused on a single modality. But that's not to say-- I mean, I think there is a lot 297 00:24:21,490 --> 00:24:25,929 of value on multiple modalities. But the challenge still comes up on, how do you integrate this 298 00:24:25,929 --> 00:24:30,440 data, especially if they're collected from different systems? Yep? 299 00:24:30,440 --> 00:24:35,470 AUDIENCE: Just on the structure versus unstructured. So it's not really my area, but I am surprised 300 00:24:35,470 --> 00:24:39,120 to see speech in the structured [INAUDIBLE]. And I wonder. Is that just because the technologies 301 00:24:39,120 --> 00:24:46,840 that can force the [INAUDIBLE] and all of this data conditioning are mature enough that 302 00:24:46,840 --> 00:24:50,529 you can basically treat it [INAUDIBLE]? 303 00:24:50,529 --> 00:24:55,239 VIJAY GADEPALLY: So the question is, why would speech or something else like that fall into 304 00:24:55,239 --> 00:24:59,969 structured versus unstructured? And you're absolutely right. I think when we pick speech-- 305 00:24:59,969 --> 00:25:02,909 and I'm sure there are others in the room that might disagree with that and might stick 306 00:25:02,909 --> 00:25:04,989 it over here. 307 00:25:04,989 --> 00:25:08,979 When we look at the type of acquisition processes that are used, the software that's used, they 308 00:25:08,979 --> 00:25:13,630 typically come out with some known metadata. They follow a certain pattern that we can 309 00:25:13,630 --> 00:25:17,610 then use, right? There is a clear range to where there is-- the frequency to which the 310 00:25:17,610 --> 00:25:22,370 data is collected. And that's why we stuck it in the structured data type. 311 00:25:22,370 --> 00:25:26,529 Of course, if you're collecting data out in the field without that, you could probably 312 00:25:26,529 --> 00:25:31,120 stick it into the unstructured world as well. But that's probably a good example of something 313 00:25:31,120 --> 00:25:39,479 that can fall in between the two places. OK. All right. Now, for the part everyone's really 314 00:25:39,479 --> 00:25:41,399 interested in-- machine learning, right? 315 00:25:41,399 --> 00:25:46,340 So, all right, you got through the boring data conditioning stuff, which will take you 316 00:25:46,340 --> 00:25:52,479 a couple of years or something like that. Nothing serious. And now, you're ready to 317 00:25:52,479 --> 00:25:57,379 do the machine learning. And now, you're given a choice. Well, which algorithm do you use? 318 00:25:57,379 --> 00:25:59,820 Neural networks, you might say, right? 319 00:25:59,820 --> 00:26:07,849 There is a lot more, though, beyond the neural network world. So there is numerous taxonomies. 320 00:26:07,849 --> 00:26:12,229 I'm going to give you two of them today for how you describe machine learning algorithms. 321 00:26:12,229 --> 00:26:17,580 One that's really an interesting way is from Pedro Domingos at the University of Washington 322 00:26:17,580 --> 00:26:22,429 in which he says that there are five tribes of machine learning. 323 00:26:22,429 --> 00:26:27,369 So there are the symbolists, which an example of that would be expert systems. There are 324 00:26:27,369 --> 00:26:32,799 the Bayesian tribes, which an example of an algorithm within that might be naive Bayes. 325 00:26:32,799 --> 00:26:37,799 There are the analogizers, which an example of that would be a support vector machine 326 00:26:37,799 --> 00:26:43,169 and the connectionists, an example of which would be deep neural networks. And evolutionary 327 00:26:43,169 --> 00:26:46,929 is an example of that which might be genetic programming. 328 00:26:46,929 --> 00:26:52,619 What really I'm trying to get across-- I'm sure the author is trying to get across here 329 00:26:52,619 --> 00:26:57,580 is that lots and lots of different algorithms. Each have their relative merits and relative 330 00:26:57,580 --> 00:27:03,109 strengths. Apply the right one for your application. Apply the right one for-- again, given these 331 00:27:03,109 --> 00:27:06,970 dials that I talked about earlier, the amount of computing that you have available, the 332 00:27:06,970 --> 00:27:10,330 amount of data that you have available, and the amount of expert knowledge that you're 333 00:27:10,330 --> 00:27:16,109 able to encode into your algorithm that you think is generalizable enough. 334 00:27:16,109 --> 00:27:23,309 If we actually talk about-- this is a very useful chart I found in describing to folks 335 00:27:23,309 --> 00:27:29,529 that are not familiar with AI that might say, wasn't AI just neural networks? And neural 336 00:27:29,529 --> 00:27:35,619 networks are a part of AI, but not necessarily all of it. So if we think of the big circle 337 00:27:35,619 --> 00:27:41,450 is the broad field of artificial intelligence, within that is the world of machine learning. 338 00:27:41,450 --> 00:27:46,350 Within machine learning are connectionists or neural networks that fall into a small 339 00:27:46,350 --> 00:27:52,889 camp within that. And deep neural networks is a part of neural network. So can anyone 340 00:27:52,889 --> 00:27:56,259 maybe give me an example-- although, I've said it numerous times-- of something that 341 00:27:56,259 --> 00:28:00,049 might fall out of machine learning, but into artificial intelligence from an algorithmic 342 00:28:00,049 --> 00:28:03,549 point of view? Yes? 343 00:28:03,549 --> 00:28:05,112 AUDIENCE: Graph search. 344 00:28:05,112 --> 00:28:09,969 VIJAY GADEPALLY: Graph search could be an example. I would maybe stick that into some 345 00:28:09,969 --> 00:28:11,409 of the connectionists, however. 346 00:28:11,409 --> 00:28:12,519 AUDIENCE: Expert systems. 347 00:28:12,519 --> 00:28:17,789 VIJAY GADEPALLY: Yes, exactly. So expert systems-- it's the one that comes to my mind. Or knowledge-based 348 00:28:17,789 --> 00:28:23,279 systems are an example of maybe something that fall outside of the realm of machine 349 00:28:23,279 --> 00:28:29,389 learning, again in the very strict sense, but maybe within the realm of artificial intelligence 350 00:28:29,389 --> 00:28:33,789 from an algorithmic point of view. 351 00:28:33,789 --> 00:28:36,960 OK, so that's a little bit on the algorithms. Next, let's talk about some of the modern 352 00:28:36,960 --> 00:28:43,969 computing engines that are out there. I mentioned that data compute as well as algorithms have 353 00:28:43,969 --> 00:28:50,200 been key drivers to the resurgence of AI over the past few years. What are some of these 354 00:28:50,200 --> 00:28:58,159 computing technologies, for example? So clearly, CPUs and GPUs. They're very popular computing 355 00:28:58,159 --> 00:29:02,679 platforms. Lots of software written to work with these computing platforms. 356 00:29:02,679 --> 00:29:09,019 But what we're seeing now is that with the end of Moore's Law and a lot more performance 357 00:29:09,019 --> 00:29:16,399 engineering going on, we're seeing a lot more work, research, and hardware architectures 358 00:29:16,399 --> 00:29:21,919 that are custom in nature. And custom architectures are almost the new commercial off-the-shelf 359 00:29:21,919 --> 00:29:23,980 solutions that are out there. 360 00:29:23,980 --> 00:29:29,980 So an example of a custom architecture could be Google's Tensor Processing Unit, or TPU. 361 00:29:29,980 --> 00:29:34,269 There is some very exciting research going on in the world of neuromorphic computing. 362 00:29:34,269 --> 00:29:38,279 I'm happy to chat with you all later if you're interested to know what's going on in that 363 00:29:38,279 --> 00:29:41,830 area and maybe our role in some of that work. 364 00:29:41,830 --> 00:29:45,909 And there is just some stuff that we would still call custom. These are still people 365 00:29:45,909 --> 00:29:51,690 deciding, designing, basically looking at an algorithm saying, OK, here's the data layout. 366 00:29:51,690 --> 00:29:58,119 Here is the movement of data or information within this algorithm. Let's create a custom 367 00:29:58,119 --> 00:30:03,499 processor that does that. An example of that could be the graph processor, which is being 368 00:30:03,499 --> 00:30:06,679 developed at Lincoln Laboratory. 369 00:30:06,679 --> 00:30:13,149 And obviously, no slide on computing architectures or computing technologies would be complete 370 00:30:13,149 --> 00:30:18,749 without mentioning the word quantum in it. There is some early results on solving linear 371 00:30:18,749 --> 00:30:26,469 system of equations. But I think applied to AI, it's still unsure, or unknown, or unproven 372 00:30:26,469 --> 00:30:31,330 where quantum may play a part. But certainly, a technology that all of us, I'm sure, have 373 00:30:31,330 --> 00:30:36,070 heard of, or continue to track, or just interested in seeing where that goes to. 374 00:30:36,070 --> 00:30:42,389 So within the first few, however, these are all products that you can buy today. You can 375 00:30:42,389 --> 00:30:47,679 go out to your favorite-- your computing store and just purchase these off-the-shelf solutions. 376 00:30:47,679 --> 00:30:53,710 A lot of software has been written to work with these different technologies. And it's 377 00:30:53,710 --> 00:30:56,980 a really nice time to be involved. Yeah? 378 00:30:56,980 --> 00:31:10,330 AUDIENCE: Can you give a brief just concept of what is attached to the [INAUDIBLE]? I 379 00:31:10,330 --> 00:31:12,929 see [INAUDIBLE], but I don't really have a high-level concept of why I should associate 380 00:31:12,929 --> 00:31:13,929 with that. 381 00:31:13,929 --> 00:31:15,889 VIJAY GADEPALLY: OK, so the question is, what should I think about when I'm thinking about 382 00:31:15,889 --> 00:31:22,119 neuromorphic? So there's a few features which I say fall into the camp of what people are 383 00:31:22,119 --> 00:31:26,459 calling neuromorphic computing. One is what they're calling a brain inspired architecture, 384 00:31:26,459 --> 00:31:28,499 which often means its clockless. 385 00:31:28,499 --> 00:31:36,080 So you typically have some-- so a lot of these technologies have clocked movement of information. 386 00:31:36,080 --> 00:31:44,769 These might be clockless in nature. They typically sit on top of different types of memory architectures. 387 00:31:44,769 --> 00:31:55,290 And I'm trying to think of what would be another parameter that would be very useful. I can 388 00:31:55,290 --> 00:31:58,980 probably send you a couple things that help highlight that. I certainly wouldn't call 389 00:31:58,980 --> 00:32:00,619 myself an expert in this area. 390 00:32:00,619 --> 00:32:01,619 AUDIENCE: OK, thanks. 391 00:32:01,619 --> 00:32:05,309 VIJAY GADEPALLY: But, yeah. I think the term that's used is it's supposed to mimic the 392 00:32:05,309 --> 00:32:13,320 brain in the way that the computing architecture actually performs or functions. So lots of 393 00:32:13,320 --> 00:32:18,279 research as well. And this is work that we've done here at the lab on actually trying to 394 00:32:18,279 --> 00:32:23,889 map the performance of these different processors and how they perform for different types of 395 00:32:23,889 --> 00:32:25,489 functions. 396 00:32:25,489 --> 00:32:31,169 So what we're doing here is basically looking at the power on the x-axis. And the y-axis 397 00:32:31,169 --> 00:32:36,639 is the peak performance in giga operations per second. Different types of precision are 398 00:32:36,639 --> 00:32:42,049 noted over there by the different shapes of the boxes and then different form factors. 399 00:32:42,049 --> 00:32:46,950 And the idea here is basically to say there's so much going on in the world of computing. 400 00:32:46,950 --> 00:32:51,339 How can we compare them? They all have their own individual areas where they're strong. 401 00:32:51,339 --> 00:32:56,279 So one can't come up and say, well, the GPU is better than the CPU. Well, it depends on 402 00:32:56,279 --> 00:32:59,880 what you're trying to do and what your goals of the operation are. 403 00:32:59,880 --> 00:33:05,509 So some of the key lines to note here is that there seems to be a lot of existing systems 404 00:33:05,509 --> 00:33:11,970 on this 100 giga operations per watt on this line over here, this dash line. Some of the 405 00:33:11,970 --> 00:33:17,380 newer offerings maybe fit into the 1 tera op per watt. And some of the research chips 406 00:33:17,380 --> 00:33:23,479 like IBM's TrueNorth or Intel's Arria fall into just a bit under the 10 tera operations 407 00:33:23,479 --> 00:33:28,759 per watt line that we see there. 408 00:33:28,759 --> 00:33:33,119 But depending on the type of application, you may be OK with a certain amount of peak 409 00:33:33,119 --> 00:33:36,729 power. So if you're looking at embedded applications, you're probably somewhere over here, right? 410 00:33:36,729 --> 00:33:42,299 If you're trying to get something that's on a little drone or something like that, you 411 00:33:42,299 --> 00:33:45,809 might want to go here. And if you have a data center, you're probably OK with that type 412 00:33:45,809 --> 00:33:50,799 of power utilization or peak power utilization. But you do need the performance that goes 413 00:33:50,799 --> 00:33:51,799 along with that. 414 00:33:51,799 --> 00:33:55,839 So I'd say the most important parts to look at are essentially these different lines. 415 00:33:55,839 --> 00:34:02,619 Those are the trajectories for maybe some of the existing systems, all the way up to 416 00:34:02,619 --> 00:34:09,659 some of the more research oriented processors out there. OK? All right. 417 00:34:09,659 --> 00:34:13,918 So we talked about modern computing. Let's talk a little bit about the robust AI side 418 00:34:13,918 --> 00:34:19,529 of things. And the basic idea between robust AI is that it's extremely important. And the 419 00:34:19,529 --> 00:34:25,179 reason that it's important is that the consequence of actions on certain applications of AI can 420 00:34:25,179 --> 00:34:28,000 be quite high. 421 00:34:28,000 --> 00:34:33,989 So what we've done here is think about, where are the places that maybe humans and machines 422 00:34:33,989 --> 00:34:38,268 have their relative strength? So on the x-axis, we're talking about the consequence of action. 423 00:34:38,268 --> 00:34:43,359 So this could be, does somebody get hurt if the system doesn't do the right thing? All 424 00:34:43,359 --> 00:34:49,579 the way down to, no worries if the system doesn't do the right thing, which could be 425 00:34:49,579 --> 00:34:55,389 maybe some of the labeling of images that we see online, might fall into this category. 426 00:34:55,389 --> 00:34:58,240 I'm sure people disagree with me on that. 427 00:34:58,240 --> 00:35:02,420 But maybe a lot of national security applications. Health applications certainly fall into the 428 00:35:02,420 --> 00:35:07,539 area of high consequence of action. If you give someone the wrong treatment, that's a 429 00:35:07,539 --> 00:35:12,349 deal. And then on the y-axis, we're talking about the confidence level in the machine 430 00:35:12,349 --> 00:35:16,840 making the decision. So how much confidence do we have in the system that's actually making 431 00:35:16,840 --> 00:35:17,859 the decision? 432 00:35:17,859 --> 00:35:22,660 In certain cases, we might have very high confidence in the system that's making a decision. 433 00:35:22,660 --> 00:35:26,349 And obviously in certain cases, we do not have much confidence in the system making 434 00:35:26,349 --> 00:35:30,970 the decision. So in areas where you have a low consequence of action, maybe high confidence 435 00:35:30,970 --> 00:35:35,190 level in the machine making the decision, we might say those are best matched to machines. 436 00:35:35,190 --> 00:35:38,510 Those are good candidates for automation. 437 00:35:38,510 --> 00:35:43,330 On the contrary, there might be areas where that consequence of action is very high. And 438 00:35:43,330 --> 00:35:47,000 we have very little confidence in the system that's making the decision, probably an area 439 00:35:47,000 --> 00:35:53,299 we want humans to be intricately, if not solely or involved or responsible. And the area in 440 00:35:53,299 --> 00:35:57,040 between is where machines might be augmenting humans. 441 00:35:57,040 --> 00:36:03,690 Does anybody want to venture maybe a couple of examples-- help come up with a couple of 442 00:36:03,690 --> 00:36:09,190 examples here that we might put into each of these categories? So maybe what's a good 443 00:36:09,190 --> 00:36:13,970 problem that you can think of that might be best matched to machines, beyond labeling 444 00:36:13,970 --> 00:36:15,180 images for advertisements? 445 00:36:15,180 --> 00:36:16,829 AUDIENCE: Assembly lines. 446 00:36:16,829 --> 00:36:24,420 VIJAY GADEPALLY: Assembly lines? Yep, that's a good example. I'm thinking within spam filtering, 447 00:36:24,420 --> 00:36:29,150 could be another example where-- I mean, there is some machine augmenting human. It does 448 00:36:29,150 --> 00:36:34,519 send you an email saying this is spam. Are you sure? But for the most part, it's largely 449 00:36:34,519 --> 00:36:36,140 automated. 450 00:36:36,140 --> 00:36:40,019 I'd say a lot of the work that many of us are probably doing falls into this category, 451 00:36:40,019 --> 00:36:43,859 maybe on different sides of the spectrum, but of where machines are augmenting humans. 452 00:36:43,859 --> 00:36:49,660 So the system can be providing data back to a human that can then select. It might filter 453 00:36:49,660 --> 00:36:54,250 information out for humans that then the humans can then go ahead and say, OK, well, instead 454 00:36:54,250 --> 00:36:59,450 of looking at a thousand documents, I can only look at 10, which is much better. 455 00:36:59,450 --> 00:37:05,329 And then there's obviously certain-- probably we want humans to be heavily involved with 456 00:37:05,329 --> 00:37:10,661 any kinetic-- anything that involves life or death-- we probably want. And there are 457 00:37:10,661 --> 00:37:14,510 probably legal reasons, also, that we want humans involved with things like that. 458 00:37:14,510 --> 00:37:20,579 One of the examples that we often get which is autonomous vehicles-- and it's always a 459 00:37:20,579 --> 00:37:24,809 little confusing where autonomous vehicles fall into this. Certainly, the consequence 460 00:37:24,809 --> 00:37:28,700 of action of a mistake in an autonomous vehicle can be pretty high. 461 00:37:28,700 --> 00:37:35,380 And as of today, the confidence and the decision-making is medium at best. But people still seem to 462 00:37:35,380 --> 00:37:40,710 somehow be OK with fully automating. That just shows how terrible Boston roads or driving 463 00:37:40,710 --> 00:37:45,940 in general is, that we're like, I'm not really sure if this thing will kill me or not, but 464 00:37:45,940 --> 00:37:48,430 totally worth trying it out. 465 00:37:48,430 --> 00:37:55,410 AUDIENCE: Do you think the trend in this chart is to slowly expand the yellow out? 466 00:37:55,410 --> 00:38:03,860 VIJAY GADEPALLY: Yes, I'd say-- the question is, is the yellow expanding? I think so. One 467 00:38:03,860 --> 00:38:08,630 could make the argument that, is it shifting that direction? Are we finding areas where-- 468 00:38:08,630 --> 00:38:13,460 and I think that's maybe the direction. We are probably looking at automating certain 469 00:38:13,460 --> 00:38:18,480 things a little bit more as confidence in decision-making goes up. 470 00:38:18,480 --> 00:38:24,029 So you might think about this frontier moving down so that maybe the green expanding slightly 471 00:38:24,029 --> 00:38:29,559 and the yellow taking over a little bit of the red. There might be some places where 472 00:38:29,559 --> 00:38:38,980 over time, we're more open to the machine making a decision and the human having a largely 473 00:38:38,980 --> 00:38:42,910 supervisory role, which I would put right at this frontier between the yellow and the 474 00:38:42,910 --> 00:38:43,910 red. 475 00:38:43,910 --> 00:38:44,910 AUDIENCE: Again, I guess it depends on what augmenting means. But I guess [INAUDIBLE] 476 00:38:44,910 --> 00:38:50,980 is truly red without any-- even cognitive augmenting. 477 00:38:50,980 --> 00:39:01,230 VIJAY GADEPALLY: I can think of some examples, but maybe I'll share it with you later. So 478 00:39:01,230 --> 00:39:05,890 certainly, a robust artificial intelligence plays a very important part in the development 479 00:39:05,890 --> 00:39:10,109 and deployment of AI systems. I won't go through the details of each of these. 480 00:39:10,109 --> 00:39:14,559 I'm sure many of you are very familiar with it. And I know a few of you are far more knowledgeable 481 00:39:14,559 --> 00:39:21,250 about this, maybe than I am. But some of the key features would be explainable AI, which 482 00:39:21,250 --> 00:39:26,780 is a system being able to describe what it's doing in an interpretable fashion. 483 00:39:26,780 --> 00:39:34,260 Metrics-- so being able to provide the right metric if you want to go beyond accuracy or 484 00:39:34,260 --> 00:39:38,359 performance. Validation and verification-- there might be cases where you're not really 485 00:39:38,359 --> 00:39:43,000 concerned about the explainability, but you just want to know that when I pass an input, 486 00:39:43,000 --> 00:39:48,359 I get a known output out of it. And is there a way to confirm that I'm able to do that? 487 00:39:48,359 --> 00:39:52,770 Another could be on security. So an example of this-- or not having security would be 488 00:39:52,770 --> 00:39:59,970 counter AI, right? So when we talk about security within the context of robust AI, it's almost 489 00:39:59,970 --> 00:40:04,539 like the cryptographic way of thinking about it, which is, can I protect the confidentiality, 490 00:40:04,539 --> 00:40:12,430 integrity, and availability of my algorithm, the data sets, the outputs, the weights, the 491 00:40:12,430 --> 00:40:14,740 biases, et cetera? 492 00:40:14,740 --> 00:40:20,970 And finally, a lot of significant importance is policy, ethics, safety, and training. This 493 00:40:20,970 --> 00:40:25,630 is actually very important in some of those applications where in the previous slide, 494 00:40:25,630 --> 00:40:31,720 we had the yellow and the red where humans and machines augmenting humans, where that 495 00:40:31,720 --> 00:40:32,720 falls. 496 00:40:32,720 --> 00:40:36,100 A lot of that might be governed by policy, ethics, safety, and training, which is some 497 00:40:36,100 --> 00:40:40,829 of the examples that I can think of, where there are policy reasons that make it that 498 00:40:40,829 --> 00:40:49,210 only a human can be involved with this, maybe with minimal input from a system. OK. 499 00:40:49,210 --> 00:40:54,309 And the final component of our AI architecture-- we've gone through conditioning, algorithms, 500 00:40:54,309 --> 00:41:01,079 computing, robust AI-- is human machine teaming. And I think what we want to get across with 501 00:41:01,079 --> 00:41:04,890 human machine teaming-- that is it really depends on the application and what you're 502 00:41:04,890 --> 00:41:09,180 trying to do. But it is important to think about the human and the machine working together. 503 00:41:09,180 --> 00:41:14,630 And there is a spectrum of where the machine will largely-- will play a large part and 504 00:41:14,630 --> 00:41:18,800 the human largely supervisory or to where the human plays a large part and the machine 505 00:41:18,800 --> 00:41:25,009 is very targeted in what you do with the machine or the AI of the system. 506 00:41:25,009 --> 00:41:30,609 But a couple of ways to think about it would be-- of course, we talked about the confidence 507 00:41:30,609 --> 00:41:36,390 level versus consequence of actions, but also, the scale versus application complexity. So 508 00:41:36,390 --> 00:41:41,309 on the top chart over there, we have on the x-axis is the application complexities. How 509 00:41:41,309 --> 00:41:46,180 complex is this application? And on the y-axis is the scale. How many times do you need to 510 00:41:46,180 --> 00:41:48,349 keep doing this thing? 511 00:41:48,349 --> 00:41:52,599 Places that machines might be more effective than humans are where we have low application 512 00:41:52,599 --> 00:41:58,670 complexity, but very, very high skill. So again, spam filtering falls into this. The 513 00:41:58,670 --> 00:42:04,880 complexity of spam filtering has gone up over time, but is something that is reasonable 514 00:42:04,880 --> 00:42:08,400 within systems. But the scale is very high, that we just don't want a human being involved 515 00:42:08,400 --> 00:42:09,579 with that process. 516 00:42:09,579 --> 00:42:13,349 And on the other end of the spectrum is where you have very high application complexity 517 00:42:13,349 --> 00:42:19,690 that'll only happen a couple of times. So this could be, say, reviewing a situation. 518 00:42:19,690 --> 00:42:24,289 Maybe a company is trying to make an acquisition. It's not going to happen over and over. 519 00:42:24,289 --> 00:42:29,660 So you might have a human involved with that, that goes through a lot of that. Maybe they 520 00:42:29,660 --> 00:42:33,590 target the system to go look for specific pieces of information. But really, it's the 521 00:42:33,590 --> 00:42:36,900 human that might be more effective in that, especially given that the situation would 522 00:42:36,900 --> 00:42:39,730 change over and over. All right. 523 00:42:39,730 --> 00:42:45,400 So with that, we're going do take a quick tour of the world of machine learning. I'll 524 00:42:45,400 --> 00:42:54,070 stop there for a second. Any questions? OK. All right. So what is machine learning? Always 525 00:42:54,070 --> 00:42:59,450 a good place to start. It's the study of algorithms to improve their performance at some task 526 00:42:59,450 --> 00:43:04,059 with experience. In this context, experience is data. 527 00:43:04,059 --> 00:43:09,829 And they typically do this by optimizing based on some performance criteria that uses example 528 00:43:09,829 --> 00:43:15,580 data or past experience. So in the world of supervised learning, that could be the example 529 00:43:15,580 --> 00:43:21,390 data. Or past experience could be the correct label, given an input data set or input data 530 00:43:21,390 --> 00:43:23,630 point. 531 00:43:23,630 --> 00:43:28,240 Machine learning is a combination of techniques from statistics, computer, and computer science 532 00:43:28,240 --> 00:43:34,150 communities. And it's the idea of getting computers to program themselves. Common tasks 533 00:43:34,150 --> 00:43:37,950 within the world of machine learning could be things like classification, regression, 534 00:43:37,950 --> 00:43:41,210 prediction, clustering, et cetera. 535 00:43:41,210 --> 00:43:46,109 For those who are maybe making the shift to machine learning from traditional programming, 536 00:43:46,109 --> 00:43:52,260 I found this, again, from Pedro Domingos to be a very useful way of describing it to people. 537 00:43:52,260 --> 00:43:56,309 So in traditional programming, you have a data set. You write a program, which would 538 00:43:56,309 --> 00:44:01,650 be if you see this, do that. When you see this, do that. For this many instances, do 539 00:44:01,650 --> 00:44:04,480 the following thing on it and then write an output out, right? 540 00:44:04,480 --> 00:44:09,530 So you input a data into the program into a computer, and the computer produces an output 541 00:44:09,530 --> 00:44:14,849 where it says, OK, I've applied this program on that data. And this gives me the output. 542 00:44:14,849 --> 00:44:19,289 Machine learning is a very different way of thinking about it in which you're almost inputting 543 00:44:19,289 --> 00:44:21,289 the data as well as the output. 544 00:44:21,289 --> 00:44:25,810 So in this case, the data could be unlabeled images. The output could be the labels associated 545 00:44:25,810 --> 00:44:30,839 with those images. And you tell the computer, figure out what the program would look like. 546 00:44:30,839 --> 00:44:34,599 And this is a slightly different way of thinking about machine learning versus traditional 547 00:44:34,599 --> 00:44:36,019 programming. 548 00:44:36,019 --> 00:44:41,759 What are some of these programs or algorithms that the computer might use to figure it out? 549 00:44:41,759 --> 00:44:47,549 So within the large realm of machine learning, we have supervised, unsupervised reinforcement 550 00:44:47,549 --> 00:44:53,030 learning. What we have in the brackets is essentially what you're providing in the world. 551 00:44:53,030 --> 00:44:56,589 In the case of supervised learning, you're providing labels, which is the correct label 552 00:44:56,589 --> 00:45:02,410 associated with an input feature or with an input data set or data point. In unsupervised 553 00:45:02,410 --> 00:45:06,080 learning, you typically have no labels, but also are limited by what the algorithm itself 554 00:45:06,080 --> 00:45:07,080 can do. 555 00:45:07,080 --> 00:45:12,109 And in the world of reinforcement learning, instead of a label per data point, you're 556 00:45:12,109 --> 00:45:15,859 providing the reward information to the system that says if you're doing more-- if you're 557 00:45:15,859 --> 00:45:18,760 doing the right thing, I'm going to give you some points. If you're doing the wrong thing, 558 00:45:18,760 --> 00:45:23,480 I'm going to take away some points-- very useful in very complex applications where 559 00:45:23,480 --> 00:45:28,210 you can't really figure out the labels associated with each data point. 560 00:45:28,210 --> 00:45:32,579 Within the world of supervised learning, the typical tasks that people have-- and I should 561 00:45:32,579 --> 00:45:36,460 note before I go through this. There's a lot of overlap between all of these different 562 00:45:36,460 --> 00:45:42,069 pieces. So this is a high-level view. But we can certainly argue about the specific 563 00:45:42,069 --> 00:45:45,849 positioning of everything. I'm sure we can. 564 00:45:45,849 --> 00:45:50,099 So within supervised learning, you can fall into classification regression. Unsupervised 565 00:45:50,099 --> 00:45:55,420 learning is typically clustering dimensionality reduction. And within these, there are different 566 00:45:55,420 --> 00:46:00,760 algorithms that fall into place. So examples could be things like neural nets that cover 567 00:46:00,760 --> 00:46:02,349 all of these spaces. 568 00:46:02,349 --> 00:46:10,589 You got-- just take regression, PCA, which might fall into dimensionality reduction, 569 00:46:10,589 --> 00:46:14,960 lots and lots of different techniques and also some in the reinforcement learning world. 570 00:46:14,960 --> 00:46:20,609 And there's just more and more and more. If you open up a survey of machine learning, 571 00:46:20,609 --> 00:46:25,140 it'll give you even more than all of these techniques over here. 572 00:46:25,140 --> 00:46:28,990 And the thing to remember when you're using machine learning is that there are some common 573 00:46:28,990 --> 00:46:34,309 pitfalls that you can fall into. An example of that would be overfitting versus underfitting 574 00:46:34,309 --> 00:46:38,690 where you come up with this awesome model that does really, really well on your training 575 00:46:38,690 --> 00:46:39,690 data. 576 00:46:39,690 --> 00:46:46,900 You apply it to your test data, and you get terrible results. You might have done a really 577 00:46:46,900 --> 00:46:51,309 good job learning the training data, but not necessarily learning-- being able to generalize 578 00:46:51,309 --> 00:46:59,130 beyond that. Sometimes it could be just the algorithm itself is unable to correctly model 579 00:46:59,130 --> 00:47:01,950 the behavior that's exhibited by the training and test data. 580 00:47:01,950 --> 00:47:06,351 I won't go through each of these again, but there might just be bad, noisy, missing data. 581 00:47:06,351 --> 00:47:09,690 That certainly happens where you end up with an algorithm with terrible results. And you 582 00:47:09,690 --> 00:47:12,700 look at it and you're like, well, why is that? And you actually look at the data that you 583 00:47:12,700 --> 00:47:17,599 did. And it was incorrect, that there was just missing features. Or it was noisy in 584 00:47:17,599 --> 00:47:23,059 nature, such that the actual phenomenology that you were trying to look for was hidden 585 00:47:23,059 --> 00:47:24,059 within the noise. 586 00:47:24,059 --> 00:47:28,730 You might have picked the wrong model. You might have used a linear model in a non-linear 587 00:47:28,730 --> 00:47:34,839 case where the phenomenology you're trying to describe is non-linear in nature, but maybe 588 00:47:34,839 --> 00:47:40,249 you've used a linear model. You've not done a good job of separating training versus testing 589 00:47:40,249 --> 00:47:44,510 data, et cetera, et cetera. 590 00:47:44,510 --> 00:47:49,220 So we'll just take a quick view into each of these different learning paradigms. So 591 00:47:49,220 --> 00:47:55,089 the first is on supervised learning. And you basically start with label data or what we 592 00:47:55,089 --> 00:47:59,460 call-- it was often referred to as ground truth. And you build a model that predicts 593 00:47:59,460 --> 00:48:02,349 labels, given new pieces of data. 594 00:48:02,349 --> 00:48:07,400 And you have two general goals. One is in regression, which is to predict some continuous 595 00:48:07,400 --> 00:48:12,839 variable or a classification, which is to predict a class or label. So if we look at 596 00:48:12,839 --> 00:48:21,130 this, the diagram on the right, we have training data that we provide, which is data and labels. 597 00:48:21,130 --> 00:48:27,349 That goes into a trained model. That's typically an iterative process where we find out, well, 598 00:48:27,349 --> 00:48:32,650 did we do a good job? That is now called a supervised learning model that we then apply 599 00:48:32,650 --> 00:48:37,539 new data, or test data, or unseen data and look at the predicted labels. 600 00:48:37,539 --> 00:48:40,960 Typically, when you are designing an algorithm like this, you'd separate out. You'd take 601 00:48:40,960 --> 00:48:45,610 your training data. You'd remove a small portion of it that you do know the labels for. That's 602 00:48:45,610 --> 00:48:50,060 your test data over here. And then you run that. And you can see, well, is it working 603 00:48:50,060 --> 00:48:52,690 well or not? 604 00:48:52,690 --> 00:48:57,400 And most of these algorithms have a training step that forms a model. So when we talk about 605 00:48:57,400 --> 00:49:02,499 machine learning in both the supervised and unsupervised sense, we'll often talk about 606 00:49:02,499 --> 00:49:08,279 training the model, which is this process, and then inference, which is the second step, 607 00:49:08,279 --> 00:49:13,660 which is where you apply unseen data. So this is the trained model in deployment or in the 608 00:49:13,660 --> 00:49:16,579 field. It's performing inference at that point. 609 00:49:16,579 --> 00:49:23,650 Of course, no class these days on machine learning and AI could go without talking about 610 00:49:23,650 --> 00:49:30,130 neural networks. And as I mentioned, neural networks do form a very important part of 611 00:49:30,130 --> 00:49:33,680 machine learning. And they certainly are an algorithm that many of you, I'm sure, are 612 00:49:33,680 --> 00:49:38,810 familiar with. And they fall well within the supervised and unsupervised. And they've been 613 00:49:38,810 --> 00:49:42,009 used for so many different applications at this point. 614 00:49:42,009 --> 00:49:46,990 So what's a neural network? A computing system inspired by biological networks. And the system 615 00:49:46,990 --> 00:49:52,710 essentially learns by repetitive training to do tasks based on examples. Much of the 616 00:49:52,710 --> 00:49:56,410 work that we've seen is typically it being applied to supervised learning, though I'll 617 00:49:56,410 --> 00:50:00,509 mention some that we are doing, some research and actually applying it for unsupervised 618 00:50:00,509 --> 00:50:04,099 learning as well. And they're quite powerful. 619 00:50:04,099 --> 00:50:09,339 The components of a neural network include inputs, layers, outputs, and weights. So these 620 00:50:09,339 --> 00:50:14,039 are often the terms that someone will use. And a deep neural network has lots of hidden 621 00:50:14,039 --> 00:50:20,799 layers. Does anyone here have a better definition for what deep neural network means beyond 622 00:50:20,799 --> 00:50:22,645 lots? I've heard definitions anywhere, 3 and above. Yes? 623 00:50:22,645 --> 00:50:24,560 AUDIENCE: [INAUDIBLE] deep neural network will occur at any recurrent networks. Because 624 00:50:24,560 --> 00:50:36,299 that has more than one layer, but not necessarily more than one layer after you have actually 625 00:50:36,299 --> 00:50:37,742 written the code for it. 626 00:50:37,742 --> 00:50:41,580 VIJAY GADEPALLY: OK, so one definition here for deep is-- and this is-- anyone have a 627 00:50:41,580 --> 00:50:48,539 better-- no. So the one to beat right now is-- a feature of a deep neural network could 628 00:50:48,539 --> 00:50:54,980 be recurrence within the network architecture, which implies that there is some depth to 629 00:50:54,980 --> 00:51:02,710 the overall network. So above 3 with recurrence-- deep. All right. 630 00:51:02,710 --> 00:51:08,040 Lots of variance within the supervised world of neural networks, such as convolutional 631 00:51:08,040 --> 00:51:12,541 neural networks, recursive neural networks, deep belief networks. One, I think, in my 632 00:51:12,541 --> 00:51:18,630 opinion-- again, since you've all asked me to opine here. I know you've not, but I think 633 00:51:18,630 --> 00:51:22,390 a reason that these are so popular these days is there's so many tools out there that are 634 00:51:22,390 --> 00:51:24,609 very easy to use. 635 00:51:24,609 --> 00:51:28,369 You can just go online and within about five minutes, write your first neural network. 636 00:51:28,369 --> 00:51:36,130 Try writing a hidden mark-off model that quickly. Maybe there are people who can, but in general. 637 00:51:36,130 --> 00:51:41,650 So what are the features of a deep neural network? So you have some input features. 638 00:51:41,650 --> 00:51:47,359 You have weights, which are essentially associated with each line over here, as well as biases 639 00:51:47,359 --> 00:51:51,789 for each of the layers that govern the interaction between the layers and then an output layer. 640 00:51:51,789 --> 00:51:57,039 So these input features can often be combined to each other. So these feature vectors that 641 00:51:57,039 --> 00:52:01,559 are coming in can often be combined. Think Jeremy we'll talk a little bit about how the 642 00:52:01,559 --> 00:52:04,520 matrix view of all of this. 643 00:52:04,520 --> 00:52:09,839 But you can think of it as if your-- an example could be if you have an image, it could be 644 00:52:09,839 --> 00:52:17,799 the RGB pixel values of each pixel in that image, could be the input feature. So you 645 00:52:17,799 --> 00:52:22,510 could have large numbers of input features. If you have a time series signal, it could 646 00:52:22,510 --> 00:52:28,410 be the amplitude or the magnitude at a particular frequency or at a particular step. 647 00:52:28,410 --> 00:52:33,140 There's often a combination of features that you might do. So in addition to the pixel 648 00:52:33,140 --> 00:52:39,000 intensities for an image, you might also then combine the spatial distance between two pixels. 649 00:52:39,000 --> 00:52:43,369 Or its position within the image may also be another input feature. And you can really 650 00:52:43,369 --> 00:52:47,280 go hog wild over here, just trying to come up with new features. 651 00:52:47,280 --> 00:52:52,030 And there's a lot of research just in that area, which is I take a data set that everyone 652 00:52:52,030 --> 00:52:56,740 knows. And I'm just going to spend a lot of time doing feature engineering, which is coming 653 00:52:56,740 --> 00:53:01,730 up with, well, what is the right way to do the features? So coming back to an earlier 654 00:53:01,730 --> 00:53:07,400 question, this is an area where people are often looking at supplementing maybe a given 655 00:53:07,400 --> 00:53:09,799 data set with additional data. 656 00:53:09,799 --> 00:53:17,249 And then fusing those two pieces together, for example, could be audio and text together 657 00:53:17,249 --> 00:53:24,579 as input features to a network, that you can then learn that might do a better job. But 658 00:53:24,579 --> 00:53:30,600 all of this is governed by this really, really simple, but powerful equation, which is that 659 00:53:30,600 --> 00:53:39,749 the output at the i plus 1th layer is given by some non-linear function of the weights 660 00:53:39,749 --> 00:53:45,650 multiplied by the inputs from the previous step, plus some bias term then. 661 00:53:45,650 --> 00:53:49,109 And when you're learning-- when you're training a machine learning model, you're essentially 662 00:53:49,109 --> 00:53:54,000 trying to figure out what the Ws are and what the Bs are. That's really what a model is 663 00:53:54,000 --> 00:54:00,640 defined as. So if we zoom into one of these pieces, it's actually pretty straightforward 664 00:54:00,640 --> 00:54:02,240 what's going on over here. 665 00:54:02,240 --> 00:54:05,749 So you have your inputs that are coming from the previous layers, so this could be your 666 00:54:05,749 --> 00:54:11,550 Y sub i. Here are the different weights, so W1, W2, W3. These are the connections or the 667 00:54:11,550 --> 00:54:18,819 weights going into a neuron or a node. And you're performing some function on these inputs. 668 00:54:18,819 --> 00:54:22,700 And that function is referred to as an activation function. 669 00:54:22,700 --> 00:54:26,130 So let's just take an example where we have some actual numbers. Maybe I've gone through. 670 00:54:26,130 --> 00:54:31,230 I've trained my models. I figured out that just for this one dot in that big network 671 00:54:31,230 --> 00:54:40,079 that we saw earlier, that my weights are 2.7, 8.6, and 0.002. My inputs from the previous 672 00:54:40,079 --> 00:54:46,779 layer is maybe -0.06, 2.5, 1.4. 673 00:54:46,779 --> 00:54:56,069 And all I'm doing is coming up with this x, which is -0.06 multiplied by 2.7, plus 2.5, 674 00:54:56,069 --> 00:55:03,780 times 8.6, plus 1.4 times that. That gives me some number-- 21.34. I apply my non-linear 675 00:55:03,780 --> 00:55:08,829 function, which in this case is a sigmoid governed by that equation at the top right. 676 00:55:08,829 --> 00:55:15,670 And I say f of 21.34, so somewhere way over there is approximately 1, right? So this-- 677 00:55:15,670 --> 00:55:21,789 probably a little less than 1, but approximately 1 for the purpose of this. 678 00:55:21,789 --> 00:55:26,220 And you just do that over and over. So really, a neural network-- I think the power of a 679 00:55:26,220 --> 00:55:30,609 neural network is it allows you to encode a lot less information than many of the other 680 00:55:30,609 --> 00:55:35,109 machine learning algorithms out there at the cost, typically, of a lot more data being 681 00:55:35,109 --> 00:55:40,519 used and a lot more computing being used. But for many people, that's perfectly fine, 682 00:55:40,519 --> 00:55:41,519 right? 683 00:55:41,519 --> 00:55:45,249 But it does take-- it's just over and over, back and forth, back and forth, back and forth 684 00:55:45,249 --> 00:55:54,670 to come up with, what's the right Ws in order for this to give me a result that looks reasonable? 685 00:55:54,670 --> 00:55:58,989 Lots of work going on and just deciding the right activation function. 686 00:55:58,989 --> 00:56:06,200 I showed you a sigmoid over there. We do a lot of work with ReLU units. The choices-- 687 00:56:06,200 --> 00:56:10,369 there are certain applications-- certain, I should say, domains or applications where 688 00:56:10,369 --> 00:56:16,560 people have found that a particular activation function tends to work well. 689 00:56:16,560 --> 00:56:21,440 But that choice is something I leave to domain experts to maybe look at their problem and 690 00:56:21,440 --> 00:56:25,989 figure out what are the relative advantages. Each of these have their own advantages. I 691 00:56:25,989 --> 00:56:30,230 know, for example, one of the big advantages of rectified linear unit is that since you're 692 00:56:30,230 --> 00:56:35,900 not limiting yourself between a -0 and 1 range, you don't have to do that. You don't run into 693 00:56:35,900 --> 00:56:39,839 a problem of vanishing gradients. That doesn't mean much for people. That's OK. We're not 694 00:56:39,839 --> 00:56:43,980 going to spend too much time talking about that anyhow. 695 00:56:43,980 --> 00:56:49,089 AUDIENCE: Vijay? 696 00:56:49,089 --> 00:56:56,760 VIJAY GADEPALLY: Yeah? 697 00:56:56,760 --> 00:56:59,383 AUDIENCE: So in general, [INAUDIBLE]. 698 00:56:59,383 --> 00:57:03,799 VIJAY GADEPALLY: So the question is picking the activation functions, picking the number 699 00:57:03,799 --> 00:57:09,750 of layers. We'll talk about that in a couple of slides. But there is a lot of art. Trial 700 00:57:09,750 --> 00:57:14,580 and error-- yes, but also, we'll call it art as well that's involved with coming up with 701 00:57:14,580 --> 00:57:15,580 that. 702 00:57:15,580 --> 00:57:20,539 A lot of what happens in practice, however, is you find an application area, which looks 703 00:57:20,539 --> 00:57:24,680 very similar to the problem that you are trying to solve. And you might borrow the architecture 704 00:57:24,680 --> 00:57:29,210 from there and use that as a starting point in coming up with where you start. Yeah? 705 00:57:29,210 --> 00:57:38,140 AUDIENCE: Are you aware of any research of some type of parameterizing the activation 706 00:57:38,140 --> 00:57:43,200 function and then trying to learn the activation function? 707 00:57:43,200 --> 00:57:48,079 VIJAY GADEPALLY: I'm sure people are doing it. I'm personally not familiar with that 708 00:57:48,079 --> 00:57:50,440 research. I don't if anyone else in the room has-- yep? 709 00:57:50,440 --> 00:57:51,839 AUDIENCE: [INAUDIBLE] DARPA D3M program, so data-driven machine learning. You're trying 710 00:57:51,839 --> 00:57:55,250 to learn both the architecture of the network and the activation function and therefore 711 00:57:55,250 --> 00:58:07,900 all the other attributes. Because you're trying to just go from data set to machine learning 712 00:58:07,900 --> 00:58:10,019 system with no human intervention. 713 00:58:10,019 --> 00:58:16,609 VIJAY GADEPALLY: So the question was, is there any research into parameterizing the activation 714 00:58:16,609 --> 00:58:23,599 function? So I guess the model as a whole. So, yeah, there is. And one of the responses 715 00:58:23,599 --> 00:58:28,770 was that there is a program run by DARPA, which is the D3M program, which is really 716 00:58:28,770 --> 00:58:37,440 looking at, can you go from data to result with no or almost no human intervention? 717 00:58:37,440 --> 00:58:42,220 I'm not familiar with activation function parameterization. But certainly, network model 718 00:58:42,220 --> 00:58:47,730 parameterization is absolutely there. So people who are running optimization models to basically 719 00:58:47,730 --> 00:58:52,660 look for-- I have this particular set of resources. What is the best model architecture that fits 720 00:58:52,660 --> 00:58:53,910 into that? 721 00:58:53,910 --> 00:58:59,190 Maybe I want to deploy this on a really tiny processor that only gives me 16 megabytes 722 00:58:59,190 --> 00:59:03,859 of memory. I want to make sure that my model and data can fit on that. Can you find what 723 00:59:03,859 --> 00:59:08,099 would be the ideal model for that? So that's absolutely something that people are doing 724 00:59:08,099 --> 00:59:14,050 right now. But I'm not sure if people are trying to come up with, I guess, brand new 725 00:59:14,050 --> 00:59:19,180 activation functions. All right. 726 00:59:19,180 --> 00:59:25,250 So lots of stuff in the neural network landscape. And as I mentioned earlier, neural network 727 00:59:25,250 --> 00:59:29,029 training is essentially adjusting weights until the function represented by the neural 728 00:59:29,029 --> 00:59:34,259 network essentially does what you would like it to do. And the key idea here is to iteratively 729 00:59:34,259 --> 00:59:36,900 adjust weights to reduce the error. 730 00:59:36,900 --> 00:59:42,040 So what you do is you take some random instantiation of your neural network or maybe based on another 731 00:59:42,040 --> 00:59:47,809 domain or another problem. You might borrow that. And you start there. And then you pass 732 00:59:47,809 --> 00:59:52,539 a data set in. You look at the output and you say, that's not right. What went wrong 733 00:59:52,539 --> 00:59:53,539 over here? 734 00:59:53,539 --> 00:59:58,480 And you go back and adjust things and do that again, and again, and again, and again, and 735 00:59:58,480 --> 01:00:02,680 again, over and over, until you get something that looks reasonable. That's really what's 736 01:00:02,680 --> 01:00:07,769 going on over there. And so real neural networks can have thousands of input, data points-- 737 01:00:07,769 --> 01:00:10,799 hundreds of layers, and millions to billions of weight changes per iteration. Yes? 738 01:00:10,799 --> 01:00:11,799 AUDIENCE: So what you're talking about is [INAUDIBLE] adjustment [INAUDIBLE]. Do you 739 01:00:11,799 --> 01:00:14,299 know of any [INAUDIBLE] this process? 740 01:00:14,299 --> 01:00:24,999 VIJAY GADEPALLY: Yes, there's a lot of work being done to parallelize this-- 741 01:00:24,999 --> 01:00:25,999 AUDIENCE: Like, for-- 742 01:00:25,999 --> 01:00:26,999 VIJAY GADEPALLY: --and by default. 743 01:00:26,999 --> 01:00:27,999 AUDIENCE: [INAUDIBLE]? 744 01:00:27,999 --> 01:00:32,740 VIJAY GADEPALLY: So the question is when-- as I just described it right now, it's a serial 745 01:00:32,740 --> 01:00:38,400 process where I pass one data point in. It goes all the way to the end. It says, oh, 746 01:00:38,400 --> 01:00:42,980 this is the output-- goes back and adjusts. Are there techniques that people are doing 747 01:00:42,980 --> 01:00:48,420 to do this in a distributed fashion? And the answer to that is a strong yes. It's a very 748 01:00:48,420 --> 01:00:53,700 active area in especially high-performance computing and machine learning. 749 01:00:53,700 --> 01:00:59,809 We might talk about this in-- are we talking about this on day three? We might talk a little 750 01:00:59,809 --> 01:01:04,609 bit about it. But there is model parallelism, which is I have the model itself distributed 751 01:01:04,609 --> 01:01:09,579 across multiple pieces. And I want to adjust different pieces of the model at the same 752 01:01:09,579 --> 01:01:15,039 time. There's research and lots of results. I think we might even have some examples that 753 01:01:15,039 --> 01:01:16,880 people are doing with that. 754 01:01:16,880 --> 01:01:25,819 AUDIENCE: Have you got some examples on the [INAUDIBLE] approach, the [INAUDIBLE] approach? 755 01:01:25,819 --> 01:01:27,470 VIJAY GADEPALLY: A little bit earlier. 756 01:01:27,470 --> 01:01:28,470 AUDIENCE: Communication [INAUDIBLE]. 757 01:01:28,470 --> 01:01:33,829 VIJAY GADEPALLY: So there are many different ways to parallelize it. One would be data 758 01:01:33,829 --> 01:01:40,270 parallelism, which is I take my big data set or big data point, and I distribute that across 759 01:01:40,270 --> 01:01:44,359 my different nodes. And each one independently learns a model that works well. And then I 760 01:01:44,359 --> 01:01:47,750 do some synchronization across these different pieces. 761 01:01:47,750 --> 01:01:51,539 There are also techniques where you have-- the model itself may be too big to sit on 762 01:01:51,539 --> 01:01:55,700 a single node or a single processing element. And you might have to distribute that. So, 763 01:01:55,700 --> 01:02:02,609 yes, a lot of very interesting research going on in that area. And by default, when you 764 01:02:02,609 --> 01:02:07,809 do run things, they are running in parallel, just even on your GPU. They're using multiple 765 01:02:07,809 --> 01:02:12,529 cores at once. So there is some level of-- within the node itself, parallelism that runs 766 01:02:12,529 --> 01:02:18,130 by default on most machine learning software. 767 01:02:18,130 --> 01:02:22,140 So inferences-- I mentioned is just using the trained model again. And the power of 768 01:02:22,140 --> 01:02:25,950 neural networks really falls within their non-linearity. So you have that non-linear 769 01:02:25,950 --> 01:02:31,230 F function that you're applying over and over and over across your layers. And this crudely 770 01:02:31,230 --> 01:02:38,509 drawn diagram on my iPad-- this is not clear at all. 771 01:02:38,509 --> 01:02:44,538 If you had Xs and Os, right? It reminds me of a song. And you have features over here. 772 01:02:44,538 --> 01:02:48,579 And you're trying to basically classify it. Which is an X? And which is an O? A linear 773 01:02:48,579 --> 01:02:54,019 classifier could do a pretty good job in this type of situation. And you could apply a neural 774 01:02:54,019 --> 01:02:57,099 network to this, but maybe it's overkill in that type of situation. 775 01:02:57,099 --> 01:03:02,339 But in some feature space, if this is how your Xs and Os are divided amongst each other 776 01:03:02,339 --> 01:03:05,910 and you're trying to come up with the right label, one thing I might suggest is maybe 777 01:03:05,910 --> 01:03:09,411 find another feature space that you could maybe get a better separation between the 778 01:03:09,411 --> 01:03:13,599 two. Or a technique like a neural network might do a very good job. Or any of these 779 01:03:13,599 --> 01:03:17,299 non-linear machine learning techniques might do a very good job for looking for these really 780 01:03:17,299 --> 01:03:20,980 complex decision boundaries that are out there. All right. 781 01:03:20,980 --> 01:03:26,999 So you mentioned earlier when you're designing a neural network, what do you have to do? 782 01:03:26,999 --> 01:03:30,770 What are the different choices, et cetera? There is a lot going on here. So you have 783 01:03:30,770 --> 01:03:35,410 to pick the depth, the number of layers, the inputs, and what the inputs are, the type 784 01:03:35,410 --> 01:03:39,910 of network that you're using, the types of layers, the training algorithm and metrics 785 01:03:39,910 --> 01:03:42,839 that you're using to assess the performance of this neural network. 786 01:03:42,839 --> 01:03:46,910 The good thing, however, is it so expensive to train a neural network, that you largely 787 01:03:46,910 --> 01:03:50,920 are not making these decisions in many cases. You just pick up what somebody else has done, 788 01:03:50,920 --> 01:03:54,660 and you start from there, and then you start. That might be-- I don't know if that's a good 789 01:03:54,660 --> 01:04:00,749 or a bad thing. But that's often a way in practice that people end up doing this. 790 01:04:00,749 --> 01:04:07,430 But there is some theory on the general approach. I think in this short amount of time, which 791 01:04:07,430 --> 01:04:11,740 I'm already over, we won't be able to get into it. But I'm happy to-- actually, these 792 01:04:11,740 --> 01:04:15,990 slides have backups on them. So when I share them with you, they do have a lot more detail 793 01:04:15,990 --> 01:04:18,730 on each of these different pieces. All right. 794 01:04:18,730 --> 01:04:22,720 Very quickly, we'll talk about unsupervised learning. And the basic idea is the task of 795 01:04:22,720 --> 01:04:27,641 describing a hidden structure from unlabeled data. So in contrast to supervised learning, 796 01:04:27,641 --> 01:04:31,579 we are not providing labels. We're just giving the algorithm a data set and saying, tell 797 01:04:31,579 --> 01:04:34,330 me something cool that's going on over here. 798 01:04:34,330 --> 01:04:39,630 Now, clearly, you can't label the data if you do that. But what you can do is maybe 799 01:04:39,630 --> 01:04:46,440 look for clusters or look for dimensions or pieces of the data that are unimportant or 800 01:04:46,440 --> 01:04:52,049 extraneous. So if we observe certain features, we would like to observe the patterns amongst 801 01:04:52,049 --> 01:04:53,349 these features. 802 01:04:53,349 --> 01:04:57,970 And the typical tasks that one would do in unsupervised learning is clustering and data 803 01:04:57,970 --> 01:05:02,869 projection, or data pre-processing, or dimensionality reduction. And the goal is to discover interesting 804 01:05:02,869 --> 01:05:10,049 things about the data set, such as subgroups, patterns, clusters, et cetera. 805 01:05:10,049 --> 01:05:13,900 In unsupervised learning-- one of the difficulties in supervised learning-- we know, right? We 806 01:05:13,900 --> 01:05:17,650 have an input. We have a label. And we're like, OK, if that input-- if my algorithm 807 01:05:17,650 --> 01:05:24,259 doesn't give me the label, bad. Go retrain. Or I know what-- I can go back, use that as 808 01:05:24,259 --> 01:05:25,829 my performance metric. 809 01:05:25,829 --> 01:05:30,279 On unsupervised learning, there is no simple goal such as maximizing a certain probability 810 01:05:30,279 --> 01:05:34,730 for the algorithm. Some of that is going to be something that you have to work on, is 811 01:05:34,730 --> 01:05:39,549 at the interclass or intraclass distance that I'm most having that separation. Is that going 812 01:05:39,549 --> 01:05:43,260 to be my performance metric? Is it the number of clusters that I'm creating? Is that the 813 01:05:43,260 --> 01:05:46,069 number of-- is that the metric that I'm using? 814 01:05:46,069 --> 01:05:50,690 But it is very popular, because it works on unlabeled data. And I'm sure many of us work 815 01:05:50,690 --> 01:05:55,359 on data sets, which are just too large or too difficult to sit and label. An example 816 01:05:55,359 --> 01:06:00,640 that comes to my mind, certainly, is in the world of cybersecurity where you're collecting 817 01:06:00,640 --> 01:06:05,140 billions and billions of networked packets. And you're trying to look for an almost behavior. 818 01:06:05,140 --> 01:06:09,400 You're not going to go through and look at each pack and be like, bad, good, what it 819 01:06:09,400 --> 01:06:13,779 is. But you might use an unsupervised technique to maybe extract out some of the relevant 820 01:06:13,779 --> 01:06:18,180 pieces, then use a supervised-- then go through the trouble of labeling that data, and then 821 01:06:18,180 --> 01:06:21,490 pass that on to a supervised learning technique. And I'm happy to share some research that 822 01:06:21,490 --> 01:06:23,740 we've been doing on that front. 823 01:06:23,740 --> 01:06:29,550 Some common techniques are within clustering and data projection. Clustering is the basic 824 01:06:29,550 --> 01:06:33,269 idea that we want to group objects or sets of features, such that objects in the same 825 01:06:33,269 --> 01:06:37,730 cluster are more similar to those of another cluster. And what you typically do for that 826 01:06:37,730 --> 01:06:45,549 is you put your data in some feature space, and you try to maximize some intracluster 827 01:06:45,549 --> 01:06:49,999 measure, which is basically saying, I want the points within my cluster to be closer 828 01:06:49,999 --> 01:06:52,339 than anything outside of my cluster, right? 829 01:06:52,339 --> 01:06:57,830 So that's a metric. And you iteratively move the membership from each. You set a number 830 01:06:57,830 --> 01:07:02,109 of clusters, saying, I need five clusters. It'll randomly assign things. And it'll keep 831 01:07:02,109 --> 01:07:06,930 adjusting the membership of a particular data point within a cluster, based on a metric 832 01:07:06,930 --> 01:07:09,720 such as the squared error. 833 01:07:09,720 --> 01:07:13,809 So in this example, we might say that, OK, these are three clusters that I get out of 834 01:07:13,809 --> 01:07:19,499 it. Dimensionality reduction is the idea of reducing the number of random variables under 835 01:07:19,499 --> 01:07:24,150 consideration. Very often, you'll collect a data set that has hundreds to thousands 836 01:07:24,150 --> 01:07:25,750 of different features. 837 01:07:25,750 --> 01:07:30,009 Maybe some of these features are not that important. Maybe they're unchanging. Or even 838 01:07:30,009 --> 01:07:35,940 if they are changing, it's not by much. And so maybe you want to remove them from consideration. 839 01:07:35,940 --> 01:07:41,479 That's when you use a technique like dimensionality reduction. And this is really, really important 840 01:07:41,479 --> 01:07:47,109 when you're doing feature selection and feature extraction in your real data sets. And you 841 01:07:47,109 --> 01:07:50,779 might also use it for other techniques, such as compression or visualization. 842 01:07:50,779 --> 01:07:55,999 So if you want to show things on Excel, showing a thousand dimensional object may be difficult. 843 01:07:55,999 --> 01:08:02,369 You might try to project it down to the two or three dimensions that are easiest to visualize. 844 01:08:02,369 --> 01:08:08,109 And of course, you can use neural networks for unsupervised learning as well. Surprise, 845 01:08:08,109 --> 01:08:09,339 surprise. 846 01:08:09,339 --> 01:08:13,510 So as much as a lot of the press you've seen has been on things like image classification 847 01:08:13,510 --> 01:08:18,560 using nice labeled data sets, there's a lot of work where you can apply it in an unsupervised 848 01:08:18,560 --> 01:08:23,899 case. And these are largely used to find better representations for data, such as clustering 849 01:08:23,899 --> 01:08:27,689 and dimensionality reduction. And they're really powerful because of their non-linear 850 01:08:27,689 --> 01:08:28,689 capabilities. 851 01:08:28,689 --> 01:08:33,618 So one example-- I won't spend way too much time on this-- is an autoencoder. And the 852 01:08:33,618 --> 01:08:37,421 basic idea behind an autoencoder is you're trying to find some compressed representation 853 01:08:37,421 --> 01:08:43,849 for data. And the way we do this is by changing the metric that we use to say that the system 854 01:08:43,849 --> 01:08:45,658 has done a good job. 855 01:08:45,658 --> 01:08:49,488 And the metric is basically-- if I have a set of input features that I'm passing in, 856 01:08:49,488 --> 01:08:55,529 I would like to do the best job in reconstructing that input at my output. And what I do is 857 01:08:55,529 --> 01:09:00,421 I squeeze it through a smaller number of layers, which forms this compressed representation 858 01:09:00,421 --> 01:09:02,339 for my data set. 859 01:09:02,339 --> 01:09:08,380 And so the idea here is, how can I pass my inputs through this narrow waste to come up 860 01:09:08,380 --> 01:09:13,068 with a reconstructed input that's very similar to my original input? And so my metric in 861 01:09:13,068 --> 01:09:19,679 this particular case is essentially the difference between the reconstructed input or the output 862 01:09:19,679 --> 01:09:27,960 and the input. And the compressed representation-- you can think of as the reduced dimensionality 863 01:09:27,960 --> 01:09:31,920 version of my problem. 864 01:09:31,920 --> 01:09:37,219 We've also done some work on replicator networks, which are also really, really cool. Happy 865 01:09:37,219 --> 01:09:41,929 to chat about that as well. And finally, we have to talk very briefly on reinforcement 866 01:09:41,929 --> 01:09:48,649 learning. And the basic-- again, at a very high level, the reason reinforcement learning 867 01:09:48,649 --> 01:09:53,219 is fundamentally different than supervised or unsupervised learning is that you're not 868 01:09:53,219 --> 01:09:57,550 passing in a label associated with an input feature. 869 01:09:57,550 --> 01:10:02,610 So there is no supervisor or a person that can label it, but just a reward signal. And 870 01:10:02,610 --> 01:10:08,460 the feedback is often delayed. And time is important, so it steps through a process. 871 01:10:08,460 --> 01:10:15,050 And the agent's actions often change the input data that it receives. So just to maybe-- 872 01:10:15,050 --> 01:10:19,260 in the interest of time, just to give you examples of where reinforcement learning could 873 01:10:19,260 --> 01:10:22,760 work and why you would use a technique like reinforcement learning. 874 01:10:22,760 --> 01:10:29,599 So flying stunt maneuvers in a helicopter. So if your helicopter is straight, you say 875 01:10:29,599 --> 01:10:33,540 keep doing more of whatever you're doing to keep it there. If the helicopter tips over, 876 01:10:33,540 --> 01:10:38,530 you say stop doing whatever you just did to do that. Could you create a supervised learning 877 01:10:38,530 --> 01:10:41,710 algorithm for doing this? Sure. Right? 878 01:10:41,710 --> 01:10:47,030 You would basically look for all the configurations of your entire system every time the helicopter 879 01:10:47,030 --> 01:10:51,330 was upright. And you would look for all the examples where your helicopter was tipping 880 01:10:51,330 --> 01:10:57,510 over or falling. And you would basically say, OK, my engine speed was this much. My rotor 881 01:10:57,510 --> 01:10:58,860 speed was this much. 882 01:10:58,860 --> 01:11:04,980 And there are probably people here who fly helicopters, so pardon me if I am completely 883 01:11:04,980 --> 01:11:10,860 oversimplifying this problem here. However, you could certainly label it that way and 884 01:11:10,860 --> 01:11:16,210 say all these configurations of the helicopter meant the helicopter was upright. All these 885 01:11:16,210 --> 01:11:19,630 configurations of the helicopter meant the helicopter was not upright. 886 01:11:19,630 --> 01:11:24,110 That would be pretty expensive and difficult data-- collect to do. Not sure how many people 887 01:11:24,110 --> 01:11:30,040 want to volunteer for-- let's do all the ones that are at faults. And lots of other applications 888 01:11:30,040 --> 01:11:36,260 beyond that. So these are really useful, especially in cases where-- what you're trying to model 889 01:11:36,260 --> 01:11:41,970 is just extremely complex. And the other really powerful thing is this tends to mimic human 890 01:11:41,970 --> 01:11:47,070 behavior. And so they're very useful in those type of applications. 891 01:11:47,070 --> 01:11:50,480 AUDIENCE: Can you explain, shortly, what a reward would look like? 892 01:11:50,480 --> 01:11:55,660 VIJAY GADEPALLY: So a reward would just be-- it would be very similar until you get points. 893 01:11:55,660 --> 01:11:59,080 So you have your algorithm that's basically trying to maximize the number of points that 894 01:11:59,080 --> 01:12:05,270 it receives, for example. And as you do-- it's very similar to what you or I would consider 895 01:12:05,270 --> 01:12:07,690 a reward playing a video game, right? 896 01:12:07,690 --> 01:12:14,380 Every time I get points, I do more of the activities that make me get points. And it's 897 01:12:14,380 --> 01:12:23,409 essentially the same concept over here. All right. So with that, I will conclude only 898 01:12:23,409 --> 01:12:30,349 20 minutes behind schedule. So I guess the long story short is there's lots of exciting 899 01:12:30,349 --> 01:12:33,889 research into AI and machine learning techniques out here. 900 01:12:33,889 --> 01:12:39,969 We did a one-hour view of this broad field that research has dedicated about six to seven 901 01:12:39,969 --> 01:12:46,170 decades of work to words, so my apologies to anyone watching this or in the room whose 902 01:12:46,170 --> 01:12:52,280 work I just jumped over. The key ingredients, however-- and I think this is most important 903 01:12:52,280 --> 01:12:58,650 to this group-- is I look at what are the problems where AI has done really well. 904 01:12:58,650 --> 01:13:03,480 These are some of the key ingredients-- data availability, computing infrastructure, and 905 01:13:03,480 --> 01:13:06,659 the domain expertise and algorithms. And I think it's very exciting to see this group 906 01:13:06,659 --> 01:13:11,909 over here, because we do have all of these pieces coming together. So great things are 907 01:13:11,909 --> 01:13:13,659 bound to happen. 908 01:13:13,659 --> 01:13:18,840 There are, I think, large challenges in data availability and readiness for AI, which is 909 01:13:18,840 --> 01:13:24,390 what we're just going to scrape the edge off during this class. And some of the computing 910 01:13:24,390 --> 01:13:28,389 infrastructure is something that we'll be talking to you about in a couple of minutes. 911 01:13:28,389 --> 01:13:33,309 And if you're interested in some of the more detailed look at any of these things, a number 912 01:13:33,309 --> 01:13:39,150 of us actually wrote-- maybe I'm biased. I think it's a great, great, great write-up. 913 01:13:39,150 --> 01:13:45,909 But, no, I think it's useful. It has its places. Obviously, a lot of material in here. But 914 01:13:45,909 --> 01:13:51,099 we try to do our best job to at least cite some of this really, really interesting work 915 01:13:51,099 --> 01:13:56,300 that's going on in the field. So with that, I'll pause for any additional questions, but 916 01:13:56,300 --> 01:13:57,300 thank you very much for your attention.