Lecture 1
Lecture Notes
- Part 1: Introduction (PDF)
- Part 2: Derivatives as Linear Operators [notes not available]
Further Readings:
- matrixcalculus.org is a fun site to play with derivatives of matrix and vector functions.
- The Matrix Cookbook (PDF) has a lot of formulas for these derivatives, but no derivations.
- Notes on Vector and Matrix Differentiation (PDF) are helpful.
- Fancier math: The perspective of derivatives as linear operators is sometimes called a Fréchet derivative and you can find lots of very abstract (what I’m calling “fancy”) presentations of this online, chock full of weird terminology whose purpose is basically to generalize the concept to weird types of vector spaces. The “little-o notation” o(δx) we’re using here for “infinitesimal asymptotics” is closely related to the Big O notation used in computer science, but in computer science people are typically taking the limit as the argument (often called “n”) becomes very large instead of very small. A fancy name for a row vector is a “covector” or linear form, and the fancy version of the relationship between row and column vectors is the Riesz representation theorem, but until you get to non-Euclidean geometry you may be happier thinking of a row vector as the transpose of a column vector.
Lecture 2
Lecture Notes
- Part 1: Derivatives as Linear Operators (cont.) [notes not available]
- Part 2: Two by Two Matrix Jacobians (html) (download the zip file) (pluto notebook source code) (download the source code zip file)
Further Readings:
- The terms “forward-mode” and “reverse-mode” differentiation are most prevalent in automatic differentiation (AD), which we will cover later in this course.
- You can find many, many articles online about backpropagation in neural networks.
- There are many other versions of this, e.g. in differential geometry the derivative linear operator (multiplying Jacobians and perturbations dx right-to-left) is called a pushforward, whereas multiplying a gradient row vector (covector) by a Jacobian left-to-right is called a pullback.
- This video on “Understanding Automatic Differentiation” by Dr. Mohamed Tarek also starts with a similar left-to-right (reverse) vs right-to-left (forward) viewpoint and goes into how it translates to Julia code, and how you define custom chain-rule steps for Julia AD.
Lecture 3
Lecture Notes
- Part 1: The Gradient of a Scalar Function of a Vector: Column Vector or Row Vector? (PDF)
- Part 2: Finite Difference (Jupyter notebook) (download “Finite Difference” zip file)
Further Readings:
- Wikipedia has a useful list of properties of the matrix trace.
- The “matrix dot product” introduced today is also called the Frobenius inner product, and the corresponding norm (“length” of the matrix viewed as a vector) is the Frobenius norm.
- When you “flatten” a matrix A by stacking its columns into a single vector, the result is called vec(A), and many important linear operations on matrices can be expressed as Kronecker products.
- The Matrix Cookbook (PDF) has lots of formulas for derivatives of matrix functions.
- There is a lot of information online on finite difference, 18.303 Notes on Finite Differences (PDF), or Section 5.7 of Numerical Derivatives (PDF).
- The Julia FiniteDifferences.jl package provides lots of algorithms to compute finite-difference approximations; a particularly robust and powerful way to obtain high accuracy is to employ Richardson extrapolation to smaller and smaller δx. If you make δx too small, the finite precision (#digits) of floating-point arithmetic leads to catastrophic cancellation errors.
Lecture 4
Lecture Notes
- Part 1: The Gradient of the Determinant (html) (download the zip file) (Julia source) (download Julia source zip file)
- Part 2: Nonlinear Root-Finding, Optimization, and Adjoint-Method Differentiation (PDF)
Further Readings (Part 1):
- There are lots of discussions of the derivative of a determinant online, involving the “adjugate” matrix det(A)A⁻¹. Not as well documented is that the gradient of the determinant is the cofactor matrix widely used for the Laplace expansion of a determinant.
- The formula for the derivative of log(det X) is also nice, and logs of determinants appear in surprisingly many applications (from statistics to quantum field theory).
- The Matrix Cookbook (PDF) contains many of these formulas, but no derivations.
- A nice application of d(det(A)) is solving for eigenvalues λ by applying Newton’s method to det(A-λI)=0, and more generally one can solve det(M(λ))=0 for any function Μ(λ) — the resulting roots λ are called nonlinear eigenvalues (if M is nonlinear in λ), and one can apply Newton’s method in “The Nonlinear Eigenvalue Problem: Part II (PDF - 1.2 MB)” using the determinant-derivative formula here.
Further Readings (Part 2):
- There are many textbooks on Nonlinear Programming (by Bertsekas), including specialized books on Convex Optimization (by Boyd and Vandenberghe), Introduction to Derivative-Free Optimization (by Conn, Scheinberg, and Vicente), etcetera.
- A useful review of topology-optimization methods can be found in Sigmund and Maute’s “Topology Optimization Approaches.”
- See the Notes on Adjoint Methods (PDF) and The Adjoint Method for Differentiating Complex Computations (PDF) from 18.335 Introduction to Numerical Methods.
Lecture 5
Lecture Notes
- Forward and Reverse Automatic Differentiation in a Nutshell (guest lecture by Dr. Chris Rackauckas) (download the zip file)
Further Readings:
- Googling “automatic differentiation” will turn up many, many resources—this is a huge practical field these days. ForwardDiff.jl (described in detail by “Forward-Mode Automatic Differentiation in Julia”) uses dual number arithmetic similar to lecture to compute derivatives.
- See also the article “How to Differentiate with a Computer,” or google “dual number automatic differentiation” for many other reviews.
- Implementing automatic reverse-mode AD is much more complicated than defining a new number type, unfortunately, and involves a lot more intricacies of compiler technology. See also Chris’s blog post on Engineering Trade-Offs in Automatic Differentiation, and Chris Rackauckas’s discussion post on AD limitations.
Lecture 6
Lecture Notes
- Part 1: Derivatives of Eigenproblems (html) (download the zip file) (Julia source) (download Julia source zip file)
- Part 2: Second Derivatives, Bilinear Forms, and Hessian Matrices [notes not available]
Further Readings (Part 1):
- Computing derivatives on curved surfaces (“manifolds”) is closely related to tangent spaces in differential geometry. The effect of constraints can also be expressed in terms of Lagrange multipliers, which are useful in expressing optimization problems with constraints (see also Chapter 5 of Convex Optimization by Boyd and Vandenberghe).
- In physics, first and second derivatives of eigenvalues and first derivatives of eigenvectors are often presented as part of “time-independent” perturbation theory in quantum mechanics, or as the Hellmann-Feynmann theorem for the case of dλ.
- The derivative of an eigenvector involves all of the other eigenvectors, but a much simpler “vector-Jacobian product” (involving only a single eigenvector and eigenvalue) can be obtained from left-to-right differentiation of a scalar function of an eigenvector, as reviewed in the 18.335 Notes on Adjoint Methods (PDF).
Further Readings (Part 2):
- Bilinear forms are an important generalization of quadratic operations to arbitrary vector spaces, and we saw that the second derivative can be viewed as a symmetric bilinear form. This is closely related to a quadratic form, which is just what we get by plugging in the same vector twice, e.g. the f’’(x)[δx,δx]/2 that appears in quadratic approximations for f(x+δx) is a quadratic form.
- The most familar multivariate version of f’’(x) is the Hessian matrix.
- Khan Academy has an introduction to quadratic approximation.
Lecture 7
Lecture Notes
- Part 1: Hessian Matrices (cont.) [notes not available]
- Part 2: Backpropagation through Back Substitution with a Backslash (PDF)
Further Readings (Part 1):
- Positive-definite Hessian matrices, or more generally definite quadratic forms f″, appear at extrema (f′=0) of scalar-valued functions f(x) that are local minima; there are a lot more formal treatments of the same idea—Unconstrained Optimization (PDF), and conversely Khan academy has the simple 2-variable version where you can check the sign of the 2×2 eigenvalues just by looking at the determinant and a single entry (or the trace).
- There’s a nice stackexchange discussion on why an ill-conditioned Hessian tends to make steepest descent converge slowly.
- University of Toronto’s Course Notes on Optimization (PDF - 3.3 MB) may also be helpful.
Further Readings (Part 2):
- See this blog post on Calculus on Computational Graphs for a gentle review.
- See Columbia University’s Course Notes on Computational Graphs, and Backpropagation (PDF) for a more formal approach.
Lecture 8
Lecture Notes
- Part 1: Hessian Matrices (cont.) [notes not available]
- Part 2: Differentiable Programming and Neural Differential Equations (guest lecture by Dr. Chris Rackauckas)
Further Readings (Part 1):
- See e.g. these Stanford University’s notes on Sequential Convex Programming (PDF) using trust regions (sec. 2.2).
- See 18.335 notes on Quasi-Newton Optimization: Origin of the BFGS Update (PDF).
- The fact that a quadratic optimization problem in a sphere has strong duality and hence is efficiently solvable as discussed in section 5.2.4 of the Convex Optimization by Boyd and Vandenberghe.
- There has been a lot of work on Hessian automatic computation, but for large-scale problems you can ultimately only compute Hessian-vector products efficiently in general, which are equivalent to a directional derivative of the gradient, and can be used e.g. for Newton-Krylov methods.
Further Readings (Part 2):
- A very general reference on adjoint-method (reverse-mode/backpropagation) differentiation of ODEs (and generalizations thereof), using notation similar to that of Chris R. today, is “Adjoint Sensitivity Analysis for Differential-Algebraic Equations: The Adjoint DAE System and Its Numerical Solution”.
- See also the adjoint sensitivity analysis section of Chris’s amazing DifferentialEquations.jl software suite for numerical solution of ODEs in Julia.
- There is a nice YouTube lecture on Adjoint State Method for an ODE, again using a similar notation.