Professor Suvrit Sra gives this guest lecture on stochastic gradient descent (SGD), which randomly selects a minibatch of data at each step. The SGD is still the primary method for training large-scale machine learning systems.
Full gradient descent uses all data in each step.
Stochastic method uses a minibatch of data (often 1 sample!).
Each step is much faster and the descent starts well.
Later the points bounce around / time to stop!
This method is the favorite for weights in deep learning.
Related section in textbook: VI.5
Instructor: Prof. Suvrit Sra
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.
PROFESSOR STRANG: [INAUDIBLE] Professor Suvrit Sra from EECS who taught 6.036 and the graduate version. And maybe some of you had him in one or other of those classes. So he graciously agreed to come today and to talk about Stochastic Gradient Descent, SGD. And it's terrific. Yeah, yeah.
So we're not quite at 1:05, but close. If everything is ready, then we're off. OK. Good.
PROFESSOR SRA: And your cutoff is like 1:55?
PROFESSOR STRANG: Yeah.
PROFESSOR SRA: OK.
PROFESSOR STRANG: But this is not a sharp cutoff.
PROFESSOR SRA: Why is there [INAUDIBLE] fluctuation?
PROFESSOR STRANG: There you go.
PROFESSOR SRA: Somebody changed their resolution it seems, but that's fine. It doesn't bother us. So I'm going to tell you about, let's say, one of the most ancient optimization methods, much simpler than, in fact, the more advanced methods you have already seen in class. And interestingly, this more ancient method remains "the" method for training large scale machine learning systems.
So there's a little bit of history around that. I'm not going to go too much into the history. But the bottom line, which probably Gil has also mentioned to you in class, that at least four large data science problems, in the end, stuff reduces to solving an optimization problem. And in current times these optimization problems are pretty large. So people actually started liking stuff like gradient descent, which was invented by Cauchy back in the day.
And this is how I'm writing the abstract problem. And what I want to see is-- OK, is it fitting on the page? This is my implementation in MATLAB of gradient descent, just to set the stage that this stuff really looks simple. You've already seen gradient descent. And today, essentially, in a nutshell, what really changes in this implementation of gradient descent is this part. That's it.
So you've seen gradient descent. I'm only going to change this one line. And the change of that one line, surprisingly, is driving all the deep learning tool boxes and all of large scale machine learning, et cetera. This is an oversimplification, but morally, that's it. So let's look at what's happening.
So I will become very concrete pretty soon. But abstractly, what I want you to look at is the kinds of optimization problems we are solving in machine learning. And I'll give you very concrete examples of these optimization problems so that you can relate to them better. But I'm just writing this as the key topic, that all the optimization problems that I'm going to talk about today, they look like that. You're trying to find an x over a cost function, where the cost function can be written as a sum.
In modern day machine learning parlance these are also called finite sum problems, in case you run into that term. And they just call it finite because n is finite here. In pure optimization theory parlance, n can actually go to infinity. And then they're called stochastic optimization problems-- just for terminology, if while searching the internet you run into some such terminology so you kind of know what it means.
So here is our setup in machine learning. We have a bunch of training data. On this slide, I'm calling x1 through xn. These are the training data, the raw features. Later, actually, I'll stop writing x for them and write them with the letter a. But hopefully, that's OK.
So x1 through xn, these could be just raw images, for instance, in ImageNet or some other image data set. They could be text documents. They could be anything. y1 through yn, in classical machine learning, think of them as plus minus 1 labels-- cat, not cat-- or in a regression setup as some real number.
So that's our training data. We have d dimensional raw vectors, n of those. And we have corresponding labels which can be either plus or minus 1 in a classification setting or a real number in a regression setting. It's kind of immaterial for my lecture right now. So that's the input.
And whenever anybody says large scale machine learning, what do we really mean? What we mean is that both n and d can be large. So what does that mean in words? That n is the number of training data points. So n could be, these days, what? A million, 10 million, 100 million, depends on how big computers and data sets you've got. So n can be huge.
d, the dimensionality, the vectors that we are working with-- the raw vectors-- that can also be pretty large. Think of x is an image. If it's a megapixel image, wow, d's like a million already. If you are somebody like Criteo or Facebook or Google, and your serving web advertisements, d-- these are the features-- could be like in several hundred million, even a billion, where they encode all sorts of nasty stuff and information they collect about you as users. So many nasty things they can collect, right?
So d and n are huge. And it's because both d and n are huge, we are interested in thinking of optimization methods for large scale machine learning that can handle such big d and n. And this is driving a lot of research on some theoretical computer science, including the search for sublinear time algorithms and all sorts of data structures and hashing tricks just to deal with these two quantities.
So here is an example-- super toy example. And I hope really that I can squeeze in a little bit of proof later on towards the end. I'll take a vote here in class to see if you are interested. Let's look at the most classic question, least squares regression. A is a matrix of observations-- or sorry, measurements. b are the observations. You're trying to solve Ax minus b whole square. Of course, a linear system of equations, the most classical problem in linear algebra, can also be written like that, let's say.
This can be expanded. Hopefully, you are comfortable with this norm. So x2 squared, this is just defined as that's the definition of that notation. But I'll just write it only once now. I hope you are fully familiar with that.
So by expanding that, I managed to write least squares problem in terms of what I call the finite sum right. So it's just going over all the roles in a. The end roles, let's say. So that's the classical least squares problem. It assumes this finite sum form that we care about.
Another random example is something called Lasso. Maybe if anybody of you has played with machine learning or statistics toolkits, you may have seen something called Lasso. Lasso is essentially least squares, but there's another simple term at the end. That again, looks like f of i.
Support vector machines, once a workhorse of-- there's still a workhorse horse of people who work with small to medium sized data. Deep learning stuff requires huge amount of data. If you have small to medium amount of data, logistic regression support, vector machines, trees, et cetera, this will be your first go to methods. They are still very widely used.
These problems are, again, written in terms of a loss over training data. So this again, has this awesome format, which I'll just now record here. I may not even need to repeat it. Sometimes I write it with a normalization-- you may wonder at some point, why-- as that finite sum problem. And maybe the example that you wanted to see is something like that.
So deep neural networks that are very popular these days, they are just yet another example of this finite sum problem. How are they an example of that? So you have n training data points, there's a neural network loss, like cross entropy, or what have you, squared loss, cross-- any kind of loss. yi's are the labels-- cat not cat, or maybe a multiclass. And then you have a transfer function called a deep neural network which takes raw images as input and generates a prediction whether this is a dog or not. That whole thing I'm just calling DNN.
So it's a function of ai's which are the training data. X are the [INAUDIBLE] matrices of the neural network, so I've just compressed the whole neural network into this notation. Once again, it's nothing but an instance of that finite sum. So that fi in there captures the entire neural network architecture. But mathematically, it's still just one particular instance of this finite sum problem.
And then people who do a lot of statistics, maximum likelihood estimation. This is log likelihood over n observations. You want to maximize log likelihood. Once again, just a finite sum.
So pretty much most of the problems that we're interested in machine learning and statistics, when I write them down as an optimization problem, they look like these finite sum problems. And that's the reason to develop specialized optimization procedures to solve such finite some problems. And that's where SGD comes in.
OK. So that's kind of just the backdrop. Let's look at now how to go about solving these problems. So hopefully this iteration is familiar to you-- gradient descent, right? OK.
So just for notation, f of x refers to that entire summation. F sub i of x refers to a single component. So if you were to try to solve-- that is, to minimize this cost function, neural network, SVM, what have you using gradient descent, that's what one iteration would look like. Because it's a finite sum, gradients are linear operators. Gradient of the sum is the sum of the gradient-- that's gradient descent for you.
And now, I'll just ask a rhetoric question that, if you put yourself in the shoes of you're [INAUDIBLE] algorithm designers, some things that you may want to think about-- what may you not like about this iteration, given that big n, big d story that I told you? So anybody have any reservations or about drawbacks of this iteration? Any comments?
AUDIENCE: It's a pretty big sum.
PROFESSOR SRA: It's a pretty big sum. Especially if n is say, a billion on some bigger, [INAUDIBLE] number. That is definitely a big drawback. Because that is the prime drawback for large scale, that n be huge. There can be variety of other drawbacks. Some of those you may have seen previously when people compare whether to the gradient or to do Newton, et. Cetera but for the purpose of today, for finite sums, the big drawback is computing gradient at a single point-- there's a subscript xk missing there-- involves computing the gradient of that entire sum.
That sum is some is huge. So getting a single gradient to do a single step of gradient descent for a large data set could take you hours or days. So that's a major drawback. But then if you identify that drawback, anybody have any ideas how to counter that drawback, at least, say, purely from an engineering perspective?
I heard something. Can you speak up?
PROFESSOR SRA: Using some kind of badge?
PROFESSOR SRA: You are well ahead of my slides. We are coming to that. And maybe somebody else has, essentially, the same idea. Anybody want to suggest how to circumvent that big n stuff in there? Anything-- suppose you are implementing this. What would you do?
AUDIENCE: One example at a time.
PROFESSOR SRA: One example at a time.
AUDIENCE: [INAUDIBLE] a random sample full set of n.
PROFESSOR SRA: Random sample of the full n. So these are all excellent ideas. And hence, you folks in the class have discovered the most important method for optimizing machine learning problems, sitting here in a few moments. Isn't that great? So the part that is missing is of course to make sense of, does this idea work? Why does it work?
So this idea is really at the heart of stochastic gradient descent. So let's see. Maybe I can show you an example, actually, that I-- I'll show you a simulation I found on somebody's nice web page about that. So exactly your idea, just put in slight mathematical notation, that what if at each iteration, we randomly pick some integer, i k out of the n training data points, and we instead just perform this update.
So instead of using the full gradient, you just compute the gradient of a single randomly chosen data point. So what have you done with that? One iteration is now n times faster. If n were a million or a billion wow, that's super fast. But why should this work right?
I could have done many other things. I could have not done any update and just output the 0 vector. That would take even less time. That's also an idea. It's a bad idea, but it's an idea in a similar league.
I could have done a variety of other things. Why would you think that just replacing that sum with just one random example may work? Let's see a little bit more about that.
So of course, it's n times faster, and the key question for us here, right now-- the scientific question-- is does this make sense? It makes great engineering sense. Does it make algorithmic or mathematical sense? So this idea of doing stuff in the stochastic manner was actually originally proposed by Robbins and Monro, somewhere, I think, around 1951. And that's the most advanced method that we are essentially using currently.
So I'll show you that this idea makes sense. But maybe let's first just look at a comparison of SGD with gradient descent in this guy's simulation. So this is that MATLAB code of gradient descent, and this is just a simulation of gradient descent. As you pick a different step size, that gamma in there, you move towards the optimum. If the step size is small, you make many small steps, and you keep making slow progress, and you reach there.
That's for a well-conditioned problem and an ill-conditioned problem. It takes you even larger. In a neural network type problem which is nonconvex, you have to typically work with smaller step sizes. And if you take bigger ones, you can get crazy oscillations. But that's gradient descent.
In comparison, let's hope that this loads correctly. Well, there's even a picture of Robbins, who was a co-discoverer of the stochastic gradient method. There's a nice simulation, that instead of making that kind of deterministic descent-- after all, gradient descent is called "gradient descent." At every step it descends-- it decreases the cost function.
Stochastic gradient descent is actually a misnomer. At every step it doesn't do any descent. It does not decrease the cost function. So you see, at every step, those are the contours of the cost function. Sometimes it goes up, sometimes it goes down. It fluctuates around, but it kind of stochastically still seems to be making progress towards the optimum.
And stochastic gradient descent, because it's not using exact gradients, just working with these random examples, it actually is much more sensitive to step sizes. And you can see, as I increase the step size, its behavior. This is actually full simulation for [INAUDIBLE] problem.
So initially, what I want you to notice is-- let me go through this a few times-- keep looking at what patterns you may notice in how that line is fluctuating. Hopefully this is big enough for everybody to see. So this slider that I'm shifting is just the step size. So let me just remind you, in case you forgot, the iteration. We are running x k plus 1 is x k minus some eta k-- It's called alpha there-- times some randomly chosen data point.
You compute its gradient. This is SGD. That's what we are running. And we threw away tons of information. We didn't use the full gradient. We're just using this crude gradient.
So this process is very sensitive to the other parameter in the system, which is the step size. Much more sensitive than gradient descent, in fact. And let's see. As I vary the step size, see if you can notice some patterns on how it tries to go towards an optimum.
There's a zoomed in version, also, of this later, here. I'll come to that shortly. I'll repeat again, and then I'll ask you for your observations-- if you notice some patterns. I don't know if they're necessarily apparent. That's the thing with patterns. Because I know the answer, so I see the pattern. If you don't know the answer, you may or may not see the pattern. But I want to see if you actually see the pattern as I change the step size.
So maybe that was enough simulation. Anybody have any comments on what kind of pattern you may have observed? Yep.
AUDIENCE: It seems like the clustering in the middle is getting larger and more widespread.
PROFESSOR SRA: Yeah, definitely. That's a great observation. Any other comments?
There's one more interesting thing happening here, which is a very, very typical thing for SGD, and one of the reasons why people love SGD. Let me do that once again briefly. OK, this is tiny step size-- almost zero. Close to zero-- it's not exactly zero.
So you see what happens for a very tiny step size? It doesn't look that stochastic, right? But that's kind of obvious from there if eta k is very tiny, you'll hardly make any move. So things will look very stable. And in fact, the speed at which stochastic gradient converges, that's extremely sensitive to how you pick the step size. It's still an open research problem to come up with the best way to pick step sizes. So it's even that simple, it doesn't mean it's trivial.
And as I vary the step size, it make some progress, and it goes towards the solution. Are you now beginning to see that it seems to be making a more stable progress in the beginning? And when it comes close to the solution, it's fluctuating more. And the bigger the step size, the amount of fluctuation near the solution is wilder as he noticed back there.
But one very interesting thing is more or less constant. There is more fluctuation also on the outside, but you see that the initial part still seems to be making pretty good progress. And as you come close to the solution, it fluctuates more. And that is a very principally typical behavior of stochastic gradient descent, that in the beginning, it makes rapid strides. So you may see your training loss decrease super fast and then kind of peter out.
And it's this particular behavior which guard people super excited, that, hey, in machine learning, we are working with all sorts of big data. I just want a quick and dirty progress on my training. I don't care about getting to the best optimum. Because in machine learning, you don't just care about solving the optimization problem, you actually care about finding solutions that work well on unseen data.
So that means you don't want to over fit and solve the optimization problem supremely well. So it's great to make rapid initial progress. And if after that progress peters out, it's OK. This intuitionistic statement that I'm making, in some nice cases like convex optimization problems, one can mathematically fully quantify these. One can prove theorems to quantify each thing that I said in terms of how close, how fast, and so on. We'll see a little bit of that.
And this is what really happens to SGD. It makes great initial progress, and regardless of how you use step sizes, close to the optimum it can either get stuck, or enter some kind of chaos dynamics, or just behave like crazy. So that's typical of SGD.
And let's look at now slight mathematical insight into roughly why this behavior may happen. This is a trivial, one-dimensional optimization problem, but it conveys the crux of why this behavior is displayed by stochastic gradient methods. That it works really well in the beginning, and then, God knows what happens when it comes close to the optimum, anything can happen. So let's look at that.
OK. So let's look at a simple, one-dimensional optimization problem. I'll kind of draw it out maybe on the other side so that people on this side are not disadvantaged. So I'll just draw out at least squares problem-- x is one dimensional. Previously, I had ai transpose x Now, ai is also a scalar. So it's just 1D stuff-- everything is 1D
So this is our setup. Think of ai into x minus b i. These are quadratic functions, so they look like this. Corresponding to different eyes, there's like some different functions sitting and so on.
So these are my n different loss functions, and I want to minimize those. We know-- we can actually explicitly compute the solution of that problem. So you set the derivative of f of x to 0 so. You set the gradient of f of x to 0. Hopefully, that's easy for you to do.
So if you do that differentiation, will get gradient of f of x will just given by. Well, you can do that in your head, I'll just write it out explicitly. aix minus bi times ai Is equal to zero, and you solve that for x. You get x star, the optimum of this least squares problem. So we actually know how to solve it pretty easily.
That's a really cool example actually. I got that from textbook by Professor Dimitry [INAUDIBLE]. Now, a very interesting thing. We are not going to use the full gradient, we are only going to use the gradients of individual components. So what does the minimum of an individual component look like?
Well, the minimum of an individual component is attained when we can set this thing to 0. And that thing becomes 0 if we just pick x equal to bi divided by ai, right? So a single component can be minimized by that choice.
So you can do a little bit of arithmetic mean, geometric mean type inequalities to draw this picture. So over all i from 1 through n, this is the minimum value of this ratio, ai by bi. And let's say this is the max value of ai by bi.
And we know that closed form solution, that is the true solution. So you can verify with some algebra that that solution will lie in this interval. So you may want to-- this is a tiny exercise for you. Hopefully some of you love inequalities like me. So this is hopefully not such a bad exercise. But you can verify that within this range of the individual mins and max is where the combined solution lies. So of course, intuitively, with a physics styles thinking, you would have guessed that right away.
That means when you're outside where the individual solutions, let's call this the far out zone. And also, this side is the far out zone. And this region, within which the true minimum can lie, you can say, OK, that's the region of confusion.
Why I'm calling it the region of confusion? Because there, by minimizing an individual fi, you're not going to be able to tell what is the combined x star. That's all. And a very interesting thing happens now, just to gain some mathematical insight into that simulation that I showed you, that if you have a scalar x that is outside this region of confusion, which states that if you're far from the region within which an optimum can lie. So you're far away.
So you've just started out your progress, you made a random initialization, most likely far away from where the solution is. So suppose that's where you are. What happens when you're in that far out region? So if you're in the far out region, you use a stochastic gradient of some i-th component.
So the full gradient will look like that. A stochastic gradient looks like just one component. And when you're far out, outside that min and max regime, then you can check by just looking at it, that a stochastic gradient, in the far away regime, has exactly the same sign as the full gradient.
What does gradient descent do? It says, well, walk in the direction of the negative gradient. And far away from the optimum, outside the region of confusion, you're stochastic gradient has the same sign as the true gradient. Maybe in more linear algebra terms, it makes an acute angle with your gradient. So that means if even though a stochastic gradient is not exactly the full gradient, it has some component in the direction of the true gradient.
This is one 1D. Here it is, exactly the same sign. In multiple dimensions, this is the idea that it'll have some component in the direction of true gradient when you're far away. Which means, if you then use that direction to make an update in that style, you will end up making solid progress. And the beauty is, in the time it would have taken you to do one single iteration of batch gradient descent, far away you can do millions stochastic steps, and, each step will make some progress.
And that's where we see this dramatic, initial-- again, in the 1D case this is explicit mathematically. In the high-D case, this is more intuitive. Without further assumptions about angles, et, we can't make such a broad claim. But intuitively, this is what's happening, and why you see this awesome initial speed.
And once you're inside the region of confusion, then this behavior breaks down. Some stochastic gradient may have the same sign as the full gradient, some may not. And that's why you can get crazy fluctuations. So this simple 1D example kind of exactly shows you what we saw in that picture.
And people really love this initial progress. Because, often we also do early stopping. You train for some time, and then you say, OK, I'm done.
So importantly, if you are purely an optimization person, not thinking so much in terms of machine learning, then please keep in mind that stochastic gradient descent or stochastic gradient method is not such a great optimization method. Because once in the region of confusion, it can just fluctuate all over forever. And in machine learning, you say, oh, the region of confusion, that's fine. It'll make my method robust. It'll make my neural network training more robust. It's generalize better, et cetera, er cetera-- we like that.
So it depends on which frame of mind you're in. So that's the awesome thing about the stochastic gradient method.
So I'll give you now key mathematical ideas behind the success of SGD. This was like little illustration. Very abstractly, this is an idea that [INAUDIBLE] throughout machine learning and throughout theoretical computer science and statistics, anytime you're faced with the need to compute an expensive quantity, resort to randomization to speed up the computation. SGD is one example. The true gradient was expensive to compute, so we create a randomized estimate of the true gradient. And the randomized estimate is much faster to compute.
And mathematically, what will start happening is, depending on how good your randomized estimate is, your method may or may not convert to the right answer. So of course, one has to be careful about what particular randomized estimate one makes. But really abstractly, even if I hadn't shown you, the main idea, this idea you can apply in many other settings. If you have a difficult quantity, come up with a randomized estimate and save on computation. This is a very important theme throughout machine learning and data science.
And this is the key property. So stochastic gradient descent, it uses stochastic gradients. Stochastic is, here, used very loosely. And it just means that some randomization. That's all it means.
And the property-- the key property that we have is in expectation. The expectation is over whatever randomness you used. So if you picked some random training data point out of the million, then the expectation is over the probability distribution over what kind of randomness you used. If you picked uniformly at random from a million points, then this expectation is over that uniform probability.
But the key property for SGD, or at least the version of SGD I'm talking about, is that that over that randomness. The thing that you're pretending to use, instead of the true gradient n expectation actually it is the true gradient. So in statistics language, this is called the stochastic gradient that we use is an unbiased estimate of the true gradient. And this is a very important property in the mathematical analysis of stochastic gradient descent, that it is an unbiased estimate, And
Intuitively speaking anytime you did any proof in class, or in the book, or lecture, or to wherever, where you were using true gradients, more or less, you can do those same proofs-- more or less, not always. Using stochastic gradients by encapsulating everything within expectations over the randomness. I'll show you an example of what I mean by that. I'm just trying to simplify that for you. And
In particular, the unbiasedness is great. So it means I can kind of plug-in these stochastic gradients in place of the true gradient, and I'm still doing something meaningful. So this is answering that earlier question, why this random stuff? Why should we think it may work?
But there's another very important aspect to why it works, beyond this unbiasedness, that the amount of noise, or the amount of stochasticity is controlled. So just because it is an unbiased estimate, doesn't mean that it's going to work that well.
Why? Because it could still fluctuate hugely, right? Essentially, plus infinity here, minus infinity here. You take an average, you get 0. So that is essentially unbiased, but the fluctuation is gigantic. So whenever talking about estimates, what's the other key quantity we need to care about beyond expectation?
PROFESSOR SRA: Variance. And really, the key thing that governs the speed at which stochastic gradient descent does the job that we want it to do is, how much variance do the stochastic gradients have?
Just this simple statistical point, in fact, is at the heart of a sequence of research progress in the past five years in the field of stochastic gradient, where people have worked really hard to come up with newer and newer, fancier and fancier versions of stochastic gradient which have the unbiasedness property, but have smaller and smaller variance. And the smaller the variance you have, the better your stochastic gradient is as a replacement of the true gradient. And of course, the better [INAUDIBLE] of the true gradient, then you truly get that n times up.
So the speed of convergence depends on how noisy the stochastic gradients are. It seems like I'm going too slow. I won't be able to do a proof, which sucks. But let me actually tell you then about, rather than the proof, I think I'll share the proof with Gil. Because the proof that I wanted to actually show you, gives a proof of stochastic gradient is well-behaved on both convex and nonconvex problems. And the proof I wanted to show was for the nonconvex case, because it applies to neural networks. So you may be curious about that proof. And remarkably, that proof is much simpler than the case of convex problems.
So let me just mention some very important points about stochastic gradient. So even though this method has been around since 1951, every deep learning tool kit has it, and we are studying it in class, there are still gaps between what we can say theoretically and what happens in practice. And I'll show you those gaps already, and encourage you to think about those if you wish.
So let's look back at our problem and deliver two variants. So here are the two variants. I'm going to ask if any of you is familiar with these variants in some way or the other. So I just call it feasible. Here, there are no constraints. So start with any random vector of your choice. In deep network training you have to work harder.
And then, this is the iteration you run-- option 1 and option 2. So option 1 says, that was the idea we had in class, randomly pick some training data point, use its stochastic gradient. What do we mean by randomly pick? The moment you use the word random, you have to define what's the randomness.
So one randomness is uniform probability over n training data points. That is one randomness. The other version is you pick a training data point without replacement. So with replacement means uniformly at random. Each time you draw a number from 1 through n, use their stochastic gradient, move on. Which means the same point can easily be picked twice, also.
And without replacement means, if you've picked a point number three, you're not going to pick it again until you've gone through the entire training data set. Those are two types of randomness. Which version would you use?
There is no right or wrong answer to this. I'm just taking a poll. What would you use? Think that you're writing a program for this, and maybe think really pragmatically, practically. So that's enough of a hint. Which one would you use-- I'm just curious.
Who would use 1? Please, raise hands. OK. And the exclusion-- the compliment thereof. I don't know. Maybe some people are undecided. Who would use 2? Very few people. Ooh, OK.
How many of you use neural network training toolkits like TensorFlow, PyTorch, whatnot? Which version are they using?
Actually, every person in the real world is using version 2. Are you really going to randomly go through your RAM each time to pick random points? That'll kill your GPU performance like anything.
What people do is take a data set, use a pre-shuffle operation, and then just stream through the data. What does streaming through the data mean? Without replacement. So all the toolkits actually are using the without replacement version, even though, intuitively, uniform random feels much nicer.
And that feeling is not ill-founded, because that's the only version we know how to analyze mathematically. So even for this method, everybody studies it. There are a million papers on it. The version that is used in practice is not the version we know how to analyze. It's a major open problem in the field of stochastic gradient to actually analyze the version that we use in practice.
It's kind of embarrassing, but without replacement means non-IAD probability theory, and non-IAD probability theory is not so easy. That's the answer. OK.
So the other version is this mini-batch idea-- which you mentioned really early on-- is that rather than pick one random point, I'll pick a mini batch. So I had a million points-- each time, instead of picking one, maybe I'll pick 10, or. 100, or 1,000, or what have you. So this averages things. Averaging things reduces the variance. So this is actually a good thing, because the more quantities you average, the less noise you have. That's kind of what happened in probability.
So we pick a mini-batch, and the stochastic estimate now is this not just a single gradient, but averaged over a mini-batch. So a mini-batch of size 1 is the pure vanilla SGD. Mini-batch of size n is nothing other than pure gradient descent. Something in between is what people actually use. And again, the theoretical analysis only exists if the mini-batch is picked with replacement not without replacement.
So one of the reasons actually-- a very important thing-- in theory, you don't gain too much in terms of computational gains on convergent speed by using mini-batches. But mini-batches are really crucial, especially in the deep learning, GPU-style training, because they allow you to do things in parallel. Each thread or each core or subcore or small chip or what have, you depending on your hardware, can be working with one stochastic gradient. So mini-batches, the larger the mini batch the more things you can do in parallel.
So mini-batches are greatly exploited by people to give you a cheap version of parallelism. And where does the parallelism happen? You can think that each core computes a stochastic gradient. So the hard part is not adding these things up and making the update to x, the hard part is computing a stochastic gradient. So if you can compute 10,000 of those in parallel because you have 10,000 cores, great for you. And that's the reason people love using mini-batches.
But a nice side remark here, this also brings us closer to the research edge of things again. That, well, you'd love to use very large mini-batches so that you can fully max out on the parallelism available to you. Maybe you have a multi-GPU system, if you're friends with nVidia or Google. I only have two GPUs.
But it depends on how many GPU shows you have. You'd like to really max out on parallelism so that you can really crunch through big data sets as fast as possible. But you know what happens with very large mini-batches? So if you have very large mini-batches, stochastic gradient starts looking more like? Full gradient descent, which is also called batch gradient descent.
That's not a bad thing. That's awesome for optimization. But it is a weird conundrum that happens in training deep neural networks. This type of problem we wouldn't have for convex optimization. But in deep neural networks, this is really disturbing thing happens, that if you use this very large mini-batches, your method starts resembling gradient descent. That means it decreases noise so much so that this region of confusion shrinks so much-- which all sounds good, but it ends up being really bad for machine learning.
That's what I said, that in machine learning you want some region of uncertainty. And what it means actually is, a lot of people have been working on this, including at big companies, that if you reduce that region of uncertainty too much, you end up over-fitting your neural network. And then it starts sucking in its test data, unseen data performance. So even though for parallelism, programming, optimization theory, big mini-batch is awesome, unfortunately there's price to be paid, that it hurts your test error performance.
And there are all sorts of methods people are trying to cook up, including shrinking data accordingly, or chaining neural network architecture, and all sorts of ideas. You can cook up your ideas for your favorite architecture, how to make a large mini-batch without hurting the final performance. But it's still somewhat of an open question on how to optimally select how large your mini-batch should be. So even though these ideas are simple, you see that every simple idea leads to an entire sub area of SGD.
So here are practical challenges. People have various heuristics for solving these challenges. You can cook up your own, but it's not that one idea always works. So if you look at SGD, what are the moving parts? The moving parts in SGD-- the gradients, stochastic gradient, the step size, the mini batch. So how should I pick step sizes-- very non-trivial problem.
Different deep learning toolkits may have different ways of automating that tuning, but it's one of the painful things. Which mini batch to use? With replacement, without replacement I already showed you. But which mini batch should I use, how large that should be? Again, not an easy question to answer.
How to compute stochastic gradients. Does anybody know how stochastic gradients are computed for deep network training? Anybody know? There is a very famous algorithm called back propagation. That back propagation algorithm is used to compute a single stochastic gradient.
Some people use the word back prop to mean SGD. But what back prop really means is some kind of algorithm which computes for you a single stochastic gradient. And hence this TensorFlow, et cetera-- these toolkits-- they come up with all sorts of ways to automate the computation of a gradient. Because, really, that's the main thing.
And then other ideas like gradient clipping, and momentum, et cetera. There's a bunch of other ideas. And the theoretical challenges, I mentioned to you already-- proving that it works, that it actually solves what it set out to do. Unfortunately, I was too slow. I couldn't show you the awesome five-line proof that I have that SGD works for neural networks. And theoretical analysis, as I said, it's really laggy.
My proof also uses the with replacement. And the without replacement version, which is the one that is actually implemented, there's very little progress on that. There is some progress. There's a bunch of papers, including from our colleagues at MIT, but it's quite unsolved.
And the biggest question, which most of the people in machine learning are currently excited about these days is stuff like, why does SGD work so well for neural networks? We use this crappy optimization method, it very rapidly does some fitting-- the data is large, neural network is large, and then this neural network ends up having great classification performance. Why is that happening? It's called trying to explain-- build a theory of generalization. Why does an SGD-trained neural network work better than neural networks train with more fancy optimization methods?
It's a mystery, and most of the people who take interest in theoretical machine learning and statistics, that is one of the mysteries they're trying to understand. So I think that's my story of SGD. And this is the part we skipped, but it's OK. The intuition behind SGD is much more important in this.
So I think we can close.
PROFESSOR STRANG: Thank you.
And maybe I can learn the proof for Monday's lecture.
PROFESSOR SRA: Exactly. Yeah, I think so. That'll be great.