Description: In this lecture, Prof. Winston discusses modern breakthroughs in neural net research.
Instructor: Patrick H. Winston
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.
PATRICK H. WINSTON: Well, what we're going to do today is climb a pretty big mountain because we're going to go from a neural net with two parameters to discussing the kind of neural nets in which people end up dealing with 60 million parameters. So it's going to be a pretty big jump.
Along the way are a couple things I wanted to underscore from our previous discussion. Last time, I tried to develop some intuition for the kinds of formulas that you use to actually do the calculations in a small neural net about how the weights are going to change. And the main thing I tried to emphasize is that when you have a neural net like this one, everything is sort of divided in each column.
You can't have the performance based on this output affect some weight change back here without going through this finite number of output variables, the y1s. And by the way, there's no y2 and y4-- there's no y2 and y3. Dealing with this is really a notational nightmare, and I spent a lot of time yesterday trying to clean it up a little bit.
But basically, what I'm trying to say has nothing to do with the notation I have used but rather with the fact that there's a limited number of ways in which that can influence this, even though the number of paths through this network can be growing exponential. So those equations underneath are equations that derive from trying to figure out how the output performance depends on some of these weights back here.
And what I've calculated is I've calculated the dependence of the performance on w1 going that way, and I've also calculated the dependence of performance on w1 going that way. So that's one of the equations I've got down there. And another one deals with w3, and it involves going both this way and this way. And all I've done in both cases, in all four cases, is just take the partial derivative of performance with respect to those weights and use the chain rule to expand it.
And when I do that, this is the stuff I get. And that's just a whole bunch of partial derivatives. But if you look at it and let it sing a little bit to you, what you see is that there's a lot of redundancy in the computation. So for example, this guy here, partial of performance with respect to w1, depends on both paths, of course.
But look at the first elements here, these guys right here. And look at the first elements in the expression for calculating the partial derivative of performance with respect to w3, these guys. They're the same.
And not only that, if you look inside these expressions and look at this particular piece here, you see that that is an expression that was needed in order to calculate one of the downstream weights, the changes in one of the downstream weights. But it happens to be the same thing as you see over here. And likewise, this piece is the same thing you see over here.
So each time you move further and further back from the outputs toward the inputs, you're reusing a lot of computation that you've already done. So I'm trying to find a way to sloganize this, and what I've come up with is what's done is done and cannot be-- no, no. That's not quite right, is it? It's what's computed is computed and need not be recomputed. OK? So that's what's going on here. And that's why this is a calculation that's linear in the depths of the neural net, not exponential.
There's another thing I wanted to point out in connection with these neural nets. And that has to do with what happens when we look at a single neuron and note that what we've got is we've got a bunch of weights that you multiply times a bunch of inputs like so. And then those are all summed up in a summing box before they enter some kind of non-linearity, in our case a sigmoid function.
But if I ask you to write down the expression for the value we've got there, what is it? Well, it's just the sum of the w's times the x's. What's that? That's the dot product. Remember a few lectures ago I said that some of us believe that the dot product is a fundamental calculation that takes place in our heads? So this is why we think so. If neural nets are doing anything like this, then there's a dot product between some weights and some input values.
Now, it's a funny kind of dot product because in the models that we've been using, these input variables are all or none, or 0 or 1. But that's OK. I have it on good authority that there are neurons in our head for which the values that are produced are not exactly all or none but rather have a kind of proportionality to them. So you get a real dot product type of operation out of that.
So that's by way of a couple of asides that I wanted to underscore before we get into the center of today's discussion, which will be to talk about the so-called deep nets. Now, let's see, what's a deep net do? Well, from last time, you know that a deep net does that sort of thing, and it's interesting to look at some of the offerings here.
By the way, how good was this performance in 2012? Well, it turned out that the fraction of the time that the system had the right answer in its top five choices was about 15%. And the fraction of the time that it got exactly the right answer as its top pick was about 37%-- error, 15% error if you count it as an error if it's-- what am I saying?
You got it right if you got it in the top five. An error rate on that calculation, about 15%. If you say you only get it right if it was your top choice, then the error rate was about 37%. So pretty good, especially since some of these things are highly ambiguous even to us.
And what kind of a system did that? Well, it wasn't one that looked exactly like that, although that is the essence of it. The system actually looked like that. There's quite a lot of stuff in there. And what I'm going to talk about is not exactly this system, but I'm going to talk about the stuff of which such systems are made because there's nothing particularly special about this. It just happens to be a particular assembly of components that tend to reappear when anyone does this sort of neural net stuff.
So let me explain that this way. First thing I need to talk about is the concept of-- well, I don't like the term. It's called convolution. I don't like the term because in the second-best course at the Institute, Signals and Systems, you learn about impulse responses and convolution integrals and stuff like that. And this hints at that, but it's not the same thing because there's no memory involved in what's going on as these signals are processed. But they call it convolutional neural nets anyway.
So here you are. You got some kind of image. And even with lots of computing power and GPUs and all that sort of stuff, we're not talking about images with 4 million pixels. We're talking about images that might be 256 on a side. As I say, we're not talking about images that are 1,000 by 1,000 or 4,000 by 4,000 or anything like that. They tend to be kind of compressed into a 256-by-256 image.
And now what we do is we run over this with a neuron that is looking only at a 10-by-10 square like so, and that produces an output. And next, we went over that again having shifted this neuron a little bit like so. And then the next thing we do is we shift it again, so we get that output right there. So each of those deployments of a neuron produces an output, and that output is associated with a particular place in the image. This is the process that is called convolution as a term of art.
Now, this guy, or this convolution operation, results in a bunch of points over here. And the next thing that we do with those points is we look in local neighborhoods and see what the maximum value is. And then we take that maximum value and construct yet another mapping of the image over here using that maximum value.
Then we slide that over like so, and we produce another value. And then we slide that over one more time with a different color, and now we've got yet another value. So this process is called pooling. And because we're taking the maximum, this particular kind of pooling is called max pooling.
So now let's see what's next. This is taking a particular neuron and running it across the image. We call that a kernel, again sucking some terminology out of Signals and Systems. But now what we're going to do is we're going to say we could use a whole bunch of kernels. So the thing that I produce with one kernel can now be repeated many times like so. In fact, a typical number is 100 times.
So now what we've got is we've got a 256-by-256 image. We've gone over it with a 10-by-10 kernel. We have taken the maximum values that are in the vicinity of each other, and then we repeated that 100 times. So now we can take that, and we can feed all those results into some kind of neural net.
And then we can, through perhaps a fully-connected job on the final layers of this, and then in the ultimate output we get some sort of indication of how likely it is that the thing that's being seen is, say, a mite. So that's roughly how these things work.
So what have we talked about so far? We've talked about pooling, and we've talked about convolution. And now we can talk about some of the good stuff. But before I get into that, this is what we can do now, and you can compare this with what was done in the old days. It was done in the old days before massive amounts of computing became available is a kind of neural net activity that's a little easier to see.
You might, in the old days, only have enough computing power to deal with a small grid of picture elements, or so-called pixels. And then each of these might be a value that is fed as an input into some kind of neuron. And so you might have a column of neurons that are looking at these pixels in your image.
And then there might be a small number of columns that follow from that. And finally, something that says this neuron is looking for things that are a number 1, that is to say, something that looks like a number 1 in the image. So this stuff up here is what you can do when you have a massive amount of computation relative to the kind of thing you used to see in the old days.
So what's different? Well, what's different is instead of a few hundred parameters, we've got a lot more. Instead of 10 digits, we have 1,000 classes. Instead of a few hundred samples, we have maybe 1,000 examples of each class. So that makes a million samples. And we got 60 million parameters to play with.
And the surprising thing is that the net result is we've got a function approximator that astonishes everybody. And no one quite knows why it works, except that when you throw an immense amount of computation into this kind of arrangement, it's possible to get a performance that no one expected would be possible. So that's sort of the bottom line.
But now there are a couple of ideas beyond that that I think are especially interesting, and I want to talk about those. First idea that's especially interesting is the idea of autocoding, and here's how the idea of autocoding works. I'm going to run out of board space, so I think I'll do it right here.
You have some input values. They go into a layer of neurons, the input layer. Then there is a so-called hidden layer that's much smaller. So maybe in the example, there will be 10 neurons here and just a couple here. And then these expand to an output layer like so.
Now we can take the output layer, z1 through zn, and compare it with the desired values, d1 through dn. You following me so far? Now, the trick is to say, well, what are the desired values? Let's let the desired values be the input values. So what we're going to do is we're going to train this net up so that the output's the same as the input.
What's the good of that? Well, we're going to force it down through this [? neck-down ?] piece of network. So if this network is going to succeed in taking all the possibilities here and cramming them into this smaller inner layer, the so-called hidden layer, such that it can reproduce the input [? at ?] the output, it must be doing some kind of generalization of the kinds of things it sees on its input. And that's a very clever idea, and it's seen in various forms in a large fraction of the papers that appear on deep neural nets.
But now I want to talk about an example so I can show you a demonstration. OK? So we don't have GPUs, and we don't have three days to do this. So I'm going to make up a very simple example that's reminiscent of what goes on here but involves hardly any computation.
What I'm going to imagine is we're trying to recognize animals from how tall they are from the shadows that they cast. So we're going to recognize three animals, a cheetah, a zebra, and a giraffe, and they will each cast a shadow on the blackboard like me. No vampire involved here. And what we're going to do is we're going to use the shadow as an input to a neural net. All right? So let's see how that would work.
So there is our network. And if I just clicked into one of these test samples, that's the height of the shadow that a cheetah casts on a wall. And there are 10 input neurons corresponding to each level of the shadow. They're rammed through three inner layer neurons, and from that it spreads out and becomes the outer layer values. And we're going to compare those outer layer values to the desired values, but the desired values are the same as the input values.
So this column is a column of input values. On the far right, we have our column of desired values. And we haven't trained this neural net yet. All we've got is random values in there. So if we run the test samples through, we get that and that. Yeah, cheetahs are short, zebras are medium height, and giraffes are tall. But our output is just pretty much 0.5 for all of them, for all of those shadow heights, all right, [? with ?] no training so far.
So let's run this thing. We're just using simple [? backdrop, ?] just like on our world's simplest neural net. And it's interesting to see what happens. You see all those values changing? Now, I need to mention that when you see a green connection, that means it's a positive weight, and the density of the green indicates how positive it is. And the red ones are negative weights, and the intensity of the red indicates how red it is.
So here you can see that we still have from our random inputs a variety of red and green values. We haven't really done much training, so everything correctly looks pretty much random. So let's run this thing. And after only 1,000 iterations going through these examples and trying to make the output the same as the input, we reached a point where the error rate has dropped. In fact, it's dropped so much it's interesting to relook at the test cases.
So here's a test case where we have a cheetah. And now the output value is, in fact, very close to the desired value in all the output neurons. So if we look at another one, once again, there's a correspondence in the right two columns. And if we look at the final one, yeah, there's a correspondence in the right two columns.
Now, you back up from this and say, well, what's going on here? It turns out that you're not training this thing to classify animals. You're training it to understand the nature of the things that it sees in the environment because all it sees is the height of a shadow. It doesn't know anything about the classifications you're going to try to get out of that. All it sees is that there's a kind of consistency in the kind of data that it sees on the input values. Right?
Now, you might say, OK, oh, that's cool, because what must be happening is that that hidden layer, because everything is forced through that narrow pipe, must be doing some kind of generalization. So it ought to be the case that if we click on each of those neurons, we ought to see it specialize to a particular height, because that's the sort of stuff that's presented on the input.
Well, let's go see what, in fact, is the maximum stimulation to be seen on the neurons in that hidden layer. So when I click on these guys, what we're going to see is the input values that maximally stimulate that neuron. And by the way, I have no idea how this is going to turn out because the initialization's all random.
Well, that's good. That one looks like it's generalized the notion of short. Ugh, that doesn't look like medium. And in fact, the maximum stimulation doesn't involve any stimulation from that lower neuron. Here, look at this one. That doesn't look like tall. So we got one that looks like short and two that just look completely random.
So in fact, maybe we better back off the idea that what's going on in that hidden layer is generalization and say that what is going on in there is maybe the encoding of a generalization. It doesn't look like an encoding we can see, but there is a generalization that's-- let me start that over.
We don't see the generalization in the stimulating values. What we have instead is we have some kind of encoded generalization. And because we got this stuff encoded, it's what makes these neural nets so extraordinarily difficult to understand.
We don't understand what they're doing. We don't understand why they can recognize a cheetah. We don't understand why it can recognize a school bus in some cases, but not in others, because we don't really understand what these neurons are responding to.
Well, that's not quite true. There's been a lot of work recently on trying to sort that out, but it's still a lot of mystery in this world. In any event, that's the autocoding idea. It comes in various guises. Sometimes people talk about Boltzmann machines and things of that sort. But it's basically all the same sort of idea.
And so what you can do is layer by layer. Once you've trained the input layer, then you can use that layer to train the next layer, and then that can train the next layer after that. And it's only at the very, very end that you say to yourself, well, now I've accumulated a lot of knowledge about the environment and what can be seen in the environment. Maybe it's time to get around to using some samples of particular classes and train on classes. So that's the story on autocoding.
Now, the next thing to talk about is that final layer. So let's see what the final layer might look like. Let's see, it might look like this. There's a [? summer. ?] There's a minus 1 up here. No. Let's see, there's a minus 1 up-- [INAUDIBLE]. There's a minus 1 up there. There's a multiplier here. And there's a threshold value there.
Now, likewise, there's some other input values here. Let me call this one x, and it gets multiplied by some weight. And then that goes into the [? summer ?] as well. And that, in turn, goes into a sigmoid that looks like so. And finally, you get an output, which we'll z.
So it's clear that if you just write out the value of z as it depends on those inputs using the formula that we worked with last time, then what you see is that z is equal to 1 over 1 plus e to the minus w times x minus T-- plus T, I guess. Right? So that's a sigmoid function that depends on the value of that weight and on the value of that threshold.
So let's look at how those values might change things. So here we have an ordinary sigmoid. And what happens if we shift it with a threshold value? If we change that threshold value, then it's going to shift the place where that sigmoid comes down. So a change in T could cause this thing to shift over that way. And if we change the value of w, that could change how steep this guy is.
So we might think that the performance, since it depends on w and T, should be adjusted in such a way as to make the classification do the right thing. But what's the right thing? Well, that depends on the samples that we've seen. Suppose, for example, that this is our sigmoid function. And we see some examples of a class, some positive examples of a class, that have values that lie at that point and that point and that point.
And we have some values that correspond to situations where the class is not one of the things that are associated with this neuron. And in that case, what we see is examples that are over in this vicinity here. So the probability that we would see this particular guy in this world is associated with the value on the sigmoid curve.
So you could think of this as the probability of that positive example, and this is the probability of that positive example, and this is the probability of that positive example. What's the probability of this negative example? Well, it's 1 minus the value on that curve. And this one's 1 minus the value on that curve.
So we could go through the calculations. And what we would determine is that to maximize the probability of seeing this data, this particular stuff in a set of experiments, to maximize that probability, we would have to adjust T and w so as to get this curve doing the optimal thing. And there's nothing mysterious about it. It's just more partial derivatives and that sort of thing.
But the bottom line is that the probability of seeing this data is dependent on the shape of this curve, and the shape of this curve is dependent on those parameters. And if we wanted to maximize the probability that we've seen this data, then we have to adjust those parameters accordingly.
Let's have a look at a demonstration. OK. So there's an ordinary sigmoid curve. Here are a couple of positive examples. Here's a negative example. Let's put in some more positive examples over here. And now let's run the good, old gradient ascent algorithm on that. And this is what happens. You've seen how the probability, as we adjust the shape of the curve, the probability of seeing those examples of the class goes up, and the probability of seeing the non-example goes down.
So what if we put some more examples in? If we put a negative example there, not much is going to happen. What would happen if we put a positive example right there? Then we're going to start seeing some dramatic shifts in the shape of the curve. So that's probably a noise point. But we can put some more negative examples in there and see how that adjusts the curve.
All right. So that's what we're doing. We're viewing this output value as something that's related to the probability of seeing a class. And we're adjusting the parameters on that output layer so as to maximize the probability of the sample data that we've got at hand. Right?
Now, there's one more thing. Because see what we've got here is we've got the basic idea of back propagation, which has layers and layers of additional-- let me be flattering and call them ideas layered on top. So here's the next idea that's layered on top.
So we've got an output value here. And it's a function after all, and it's got a value. And if we have 1,000 classes, we're going to have 1,000 output neurons, and each is going to be producing some kind of value. And we can think of that value as a probability.
But I didn't want to write a probability yet. I just want to say that what we've got for this output neuron is a function of class 1. And then there will be another output neuron, which is a function of class 2. And these values will be presumably higher-- this will be higher if we are, in fact, looking at class 1. And this one down here will be, in fact, higher if we're looking at class m.
So what we would like to do is we'd like to not just pick one of these outputs and say, well, you've got the highest value, so you win. What we want to do instead is we want to associate some kind of probability with each of the classes. Because, after all, we want to do things like find the most probable five.
So what we do is we say, all right, so the actual probability of class 1 is equal to the output of that sigmoid function divided by the sum over all functions. So that takes all of that entire output vector and converts each output value into a probability.
So when we used that sigmoid function, we did it with the view toward thinking about that as a probability. And in fact, we assumed it was a probability when we made this argument. But in the end, there's an output for each of those classes. And so what we get is, in the end, not exactly a probability until we divide by a normalizing factor.
So this, by the way, is called-- not on my list of things, but it soon will be. Since we're not talking about taking the maximum and using that to classify the picture, what we're going to do is we're going to use what's called softmax. So we're going to give a range of classifications, and we're going to associate a probability with each. And that's what you saw in all of those samples. You saw, yes, this is [? containership, ?] but maybe it's also this, that, or a third, or fourth, and fifth thing.
So that is a pretty good summary of the kinds of things that are involved. But now we've got one more step, because what we can do now is we can take this output layer idea, this softmax idea, and we can put them together with the autocoding idea. So we've trained just a layer up. And now we're going to detach it from the output layer but retain those weights that connect the input to the hidden layer.
And when we do that, what we're going to see is something that looks like this. And now we've got a trained first layer but an untrained output layer. We're going to freeze the input layer and train the output layer using the sigmoid curve and see what happens when we do that.
Oh, by the way, let's run our test samples through. You can see it's not doing anything, and the output is half for each of the categories even though we've got a trained middle layer. So we have to train the outer layer. Let's see how long it takes. Whoa, that was pretty fast. Now there's an extraordinarily good match between the outputs and the desired outputs. So that's the combination of the autocoding idea and the softmax idea.
[? There's ?] just one more idea that's worthy of mention, and that's the idea of dropout. The plague of any neural net is that it gets stuck in some kind of local maximum. So it was discovered that these things train better if, on every iteration, you flip a coin for each neuron.
And if the coin ends up tails, you assume it's just died and has no influence on the output. It's called dropping out those neurons. And in our next iteration, you drop out a different set. So what this seems to do is it seems to prevent this thing from going into a frozen local maximum state. So that's deep nets. They should be called, by the way, wide nets because they tend to be enormously wide but rarely more than 10 columns deep.
Now, let's see, where to go from here? Maybe what we should do is talk about the awesome curiosity in the current state of the art. And that is that all of [? this ?] sophistication with output layers that are probabilities and training using autocoding or Boltzmann machines, it doesn't seem to help much relative to plain, old back propagation. So back propagation with a convolutional net seems to do just about as good as anything.
And while we're on the subject of an ordinary deep net, I'd like to examine a situation here where we have a deep net-- well, it's a classroom deep net. And we'll will put five layers in there, and its job is still to do the same thing. It's to classify an animal as a cheetah, a zebra, or a giraffe based on the height of the shadow it casts. And as before, if it's green, that means positive. If it's red, that means negative.
And right at the moment, we have no training. So if we run our test samples through, the output is always a 1/2 no matter what the animal is. All right? So what we're going to do is just going to use ordinary back prop on this, same thing as in that sample that's underneath the blackboard. Only now we've got a lot more parameters. We've got five columns, and each one of them has 9 or 10 neurons in it.
So let's let this one run. Now, look at that stuff on the right. It's all turned red. At first I thought this was a bug in my program. But that makes absolute sense. If you don't know what the actual animal is going to be and there are a whole bunch of possibilities, you better just say no for everybody. It's like when a biologist says, we don't know. It's the most probable answer.
Well, but eventually, after about 160,000 iterations, it seems to have got it. Let's run the test samples through. Now it's doing great. Let's do it again just to see if this is a fluke. And all red on the right side, and finally, you start seeing some changes go in the final layers there. And if you look at the error rate down at the bottom, you'll see that it kind of falls off a cliff. So nothing happens for a real long time, and then it falls off a cliff.
Now, what would happen if this neural net were not quite so wide? Good question. But before we get to that question, what I'm going to do is I'm going to do a funny kind of variation on the theme of dropout. What I'm going to do is I'm going to kill off one neuron in each column, and then see if I can retrain the network to do the right thing.
So I'm going to reassign those to some other purpose. So now there's one fewer neuron in the network. If we rerun that, we see that it trains itself up very fast. So we seem to be still close enough to a solution we can do without one of the neurons in each column.
Let's do it again. Now it goes up a little bit, but it quickly falls down to a solution. Try again. Quickly falls down to a solution. Oh, my god, how much of this am I going to do? Each time I knock something out and retrain, it finds its solution very fast.
Whoa, I got it all the way down to two neurons in each column, and it still has a solution. It's interesting, don't you think? But let's repeat the experiment, but this time we're going to do it a little differently. We're going to take our five layers, and before we do any training I'm going to knock out all but two neurons in each column. Now, I know that with two neurons in each column, I've got a solution. I just showed it. I just showed one.
But let's run it this way. It looks like increasingly bad news. What's happened is that this sucker's got itself into a local maximum. So now you can see why there's been a breakthrough in this neural net learning stuff. And it's because when you widen the net, you turn local maxima into saddle points. So now it's got a way of crawling its way through this vast space without getting stuck on a local maximum, as suggested by this.
All right. So those are some, I think, interesting things to look at by way of these demonstrations. But now I'd like to go back to my slide set and show you some examples that will address the question of whether these things are seeing like we see. So you can try these examples online. There are a variety of websites that allow you to put in your own picture.
And there's a cottage industry of producing papers in journals that fool neural nets. So in this case, a very small number of pixels have been changed. You don't see the difference, but it's enough to take this particular neural net from a high confidence that it's looking at a school bus to thinking that it's not a school bus. Those are some things that it thinks are a school bus.
So it appears to be the case that what is triggering this school bus result is that it's seeing enough local evidence that this is not one of the other 999 classes and enough positive evidence from these local looks to conclude that it's a school bus. So do you see any of those things? I don't. And here you can say, OK, well, look at that baseball one. Yeah, that looks like it's got a little bit of baseball texture in it. So maybe what it's doing is looking at texture.
These are some examples from a recent and very famous paper by Google using essentially the same ideas to put captions on pictures. So this, by the way, is what has stimulated all this enormous concern about artificial intelligence. Because a naive viewer looks at that picture and says, oh, my god, this thing knows what it's like to play, or be young, or move, or what a Frisbee is. And of course, it knows none of that. It just knows how to label this picture.
And to the credit of the people who wrote this paper, they show examples that don't do so well. So yeah, it's a cat, but it's not lying. Oh, it's a little girl, but she's not blowing bubbles. What about this one?
So we've been doing our own work in my laboratory on some of this. And the way the following set of pictures was produced was this. You take an image, and you separate it into a bunch of slices, each representing a particular frequency band. And then you go into one of those frequency bands and you knock out a rectangle from the picture, and then you reassemble the thing. And if you hadn't knocked that piece out, when you reassemble it, it would look exactly like it did when you started.
So what we're doing is we knock out as much as we can and still retain the neural net's impression that it's the thing that it started out thinking it was. So what do you think this is? It's identified by a neural net as a railroad car because this is the image that it started with. How about this one? That's easy, right? That's a guitar. We weren't able to mutilate that one very much and still retain the guitar-ness of it.
How about this one?
AUDIENCE: A lamp?
PATRICK H. WINSTON: What's that?
PATRICK H. WINSTON: What?
PATRICK H. WINSTON: A lamp. Any other ideas?
PATRICK H. WINSTON: Ken, what do you think it is?
AUDIENCE: A toilet.
PATRICK H. WINSTON: See, he's an expert on this subject.
It was identified as a barbell. What's that?
PATRICK H. WINSTON: A what?
PATRICK H. WINSTON: Cello. You didn't see the little girl or the instructor. How about this one?
PATRICK H. WINSTON: What?
PATRICK H. WINSTON: No.
PATRICK H. WINSTON: It's a grasshopper. What's this?
AUDIENCE: A wolf.
PATRICK H. WINSTON: Wow, you're good. It's actually not a two-headed wolf.
It's two wolves that are close together.
PATRICK H. WINSTON: That's a bird, right?
PATRICK H. WINSTON: Good for you. It's a rabbit.
How about that?
[? AUDIENCE: Giraffe. ?]
PATRICK H. WINSTON: Russian wolfhound.
PATRICK H. WINSTON: If you've been to Venice, you recognize this.
PATRICK H. WINSTON: So bottom line is that these things are an engineering marvel and do great things, but they don't see like we see.