Video Lectures

Lecture 25: Stochastic Gradient Descent

Description

Professor Suvrit Sra gives this guest lecture on stochastic gradient descent (SGD), which randomly selects a minibatch of data at each step. The SGD is still the primary method for training large-scale machine learning systems.

Summary

Full gradient descent uses all data in each step.
Stochastic method uses a minibatch of data (often 1 sample!).
Each step is much faster and the descent starts well.
Later the points bounce around / time to stop!
This method is the favorite for weights in deep learning.

Related section in textbook: VI.5

Instructor: Prof. Suvrit Sra

Problem for Lecture 25
From textbook Section VI.5

1. Suppose we want to minimize \(F(x,y)=y^2+(y-x)^2\). The actual minimum is \(F=0\) at \((x^\ast, y^\ast)=(0,0)\). Find the gradient vector \(\boldsymbol{\nabla F}\) at the starting point \((x_0, y_0)=(1,1)\). For full gradient descent (not stochastic) with step \(s=\frac{1}{2}\), where is \((x_1, y_1)\)?

Course Info

Departments
As Taught In
Spring 2018
Learning Resource Types
Lecture Videos
Problem Sets
Instructor Insights