Lecture 25: Stochastic Gradient Descent

Course Info

Instructor

Prof. Gilbert Strang

Departments

Mathematics

As Taught In

Spring 2018

Level

Undergraduate

Topics

Learning Resource Types

Lecture Videos

Problem Sets

Instructor Insights

Download Course

Video Lectures

Description

Professor Suvrit Sra gives this guest lecture on stochastic gradient descent (SGD), which randomly selects a minibatch of data at each step. The SGD is still the primary method for training large-scale machine learning systems.

Summary

Full gradient descent uses all data in each step.
Stochastic method uses a minibatch of data (often 1 sample!).
Each step is much faster and the descent starts well.
Later the points bounce around / time to stop!
This method is the favorite for weights in deep learning.

Related section in textbook: VI.5

Instructor: Prof. Suvrit Sra

Problem for Lecture 25
From textbook Section VI.5

1. Suppose we want to minimize \(F(x,y)=y^2+(y-x)^2\). The actual minimum is \(F=0\) at \((x^\ast, y^\ast)=(0,0)\). Find the gradient vector \(\boldsymbol{\nabla F}\) at the starting point \((x_0, y_0)=(1,1)\). For full gradient descent (not stochastic) with step \(s=\frac{1}{2}\), where is \((x_1, y_1)\)?