Selected lecture notes are available.
Lecture 1
Outline
Part 1: Overview, applications, and motivation.
Part 2: Rethinking derivatives as linear operators: f(x + dx) - f(x) = df = f′(x)[dx] — f′ is the linear operator that gives the change df in the output from a “tiny” change dx in the inputs, to first order in dx (i.e. dropping higher-order terms). When we have a scalar function f(x) ∈ ℝ of vector inputs x ∈ ℝⁿ, then this gives us a “row vector” f′(x) since f′(x)dx is a scalar, which we interpret as the transpose of the gradient ∇f (which we call a “column” vector), i.e. df = (∇f) ⋅ dx = (∇f)ᵀdx. When we have a vector function f(x) ∈ ℝᵐ of vector inputs x ∈ ℝⁿ, then f’(x) is a linear operator that takes n inputs to m outputs, which we can think of as an m × n matrix called the Jacobian matrix (typically covered only superficially in 18.02 Multivariable Calculus.)
Lecture Notes
- Part 1: Introduction (PDF)
- Part 2: Derivatives as Linear Operators
Further Readings
- matrixcalculus.org is a fun site to play with derivatives of matrix and vector functions.
- The Matrix Cookbook (PDF) has a lot of formulas for these derivatives, but no derivations.
- Notes on Vector and Matrix Differentiation (PDF) from 6.S087 Special Subject in Electrical Engineering and Computer Science (IAP 2021) are helpful.
- Fancier math: The perspective of derivatives as linear operators is sometimes called a Fréchet derivative and you can find lots of very abstract (what I’m calling “fancy”) presentations of this online, chock full of weird terminology whose purpose is basically to generalize the concept to weird types of vector spaces. The “little-o notation” o(δx) we’re using here for “infinitesimal asymptotics” is closely related to the Big O notation used in computer science, but in computer science people are typically taking the limit as the argument (often called “n”) becomes very large instead of very small. A fancy name for a row vector is a “covector” or linear form, and the fancy version of the relationship between row and column vectors is the Riesz representation theorem, but until you get to non-Euclidean geometry you may be happier thinking of a row vector as the transpose of a column vector.
Lecture 2
Outline
Part 1: Continued discussing derivatives as linear operators, starting with Jacobian matrices. Reviewed the sum rule d(f + g) = df + dg, the product rule d(fg) = (df)g + f(dg), and the chain rule for f(g(x)) (f’(x) = g’(h(x))h’(x), where this is a composition of two linear operations, performing h’ then g’ — g’h’ ≠ h’g’!). For functions from vectors to vectors, the chain rule is simply the product of Jacobians. Moreover, as soon as you compose 3 or more functions, it can make a huge difference whether you multiply the Jacobians from left to right (“reverse-mode”, or “backpropagation”, or “adjoint differentiation”) or right to left (“forward-mode”). Showed, for example, that if you have many inputs but a single output (as is common in machine learning and other types of optimization problems), that it is vastly more efficient to multiply left-to-right than right-to-left, and such “backpropagation algorithms” are a key factor in the practicality of large-scale optimization. Finally, began talking about functions in more general vector spaces, such as functions with matrix inputs and/or outputs. For example, considered f(A) = A³, giving d(A³) = f′(A)[dA] = A²(dA) + A(dA)A + (dA)A² (≠3A²dA!), and f(A) = A⁻¹, giving d(A⁻¹) = -A⁻¹(dA)A⁻¹.
Part 2: Began going into more detail on matrix-valued functions, and their relationship to the “Jacobian matrix” picture. Converting f′(A) to a conventional “Jacobian matrix” in such cases requires converting matrices A into column vectors vec(A), a process called “vectorization” of the matrix (by a common convention: simply stacking the matrix by columns). Linear operators like f′(A)[dA] = AdA + dAA can then be expressed as “ordinary” matrices via Kronecker products.
Lecture Notes
- Part 0: Examples of Linear and Nonlinear Transformations of ℝ² via Images (try it online)
- Part 1: Derivatives as Linear Operators (continued)
- Part 2 (Matrix Jacobians via Vectorization): Two by Two Matrix Jacobians (HTML) (Pluto notebook source code)
Further Readings
- The terms “forward-mode” and “reverse-mode” differentiation are most prevalent in automatic differentiation (AD), which we will cover later in this course.
- You can find many, many articles online about backpropagation in neural networks.
- There are many other versions of this, e.g. in differential geometry the derivative linear operator (multiplying Jacobians and perturbations dx right-to-left) is called a pushforward, whereas multiplying a gradient row vector (covector) by a Jacobian left-to-right is called a pullback.
- The video on Understanding Automatic Differentiation (in Julia) (YouTube) by Dr. Mohamed Tarek also starts with a similar left-to-right (reverse) vs right-to-left (forward) viewpoint and goes into how it translates to Julia code, and how you define custom chain-rule steps for Julia AD.
Lecture 3
Outline
Part 1: Continued from Lecture 2: matrix functions, Jacobians, vectorizations, and Kronecker products. More examples of matrix functions, including LU factorization and 2 × 2 eigenproblems.
Part 2: Finite-difference methods: viewing f(x + δx) – f(x) as an approximation for f’(x)δx on a computer. This is extremely useful as a quick check of a hand-derived derivative (which is very error-prone for complicated functions), and can also be used as a replacement for analytical derivatives in a pinch. Analyzed two sources of error: truncation error (from the non-infinitesimal δx) and roundoff error (from the finite precision of computer arithmetic).
Lecture Notes
- Part 1 (Lecture 2 Part 2 continued): Two by Two Matrix Jacobians (HTML) (Pluto notebook source code)
- Part 2: Finite Difference (Jupyter notebook)
Further Readings
- Wikipedia has a useful list of properties of the matrix trace.
- The “matrix dot product” introduced today is also called the Frobenius inner product, and the corresponding norm (“length” of the matrix viewed as a vector) is the Frobenius norm.
- When you “flatten” a matrix A by stacking its columns into a single vector, the result is called vec(A), and many important linear operations on matrices can be expressed as Kronecker products.
- The Matrix Cookbook (PDF) has lots of formulas for derivatives of matrix functions.
- There is a lot of information online on finite difference, 18.303 Notes on Finite Differences (PDF), or Section 5.7 of Numerical Derivatives (PDF).
- The Julia FiniteDifferences.jl package provides lots of algorithms to compute finite-difference approximations; a particularly robust and powerful way to obtain high accuracy is to employ Richardson extrapolation to smaller and smaller δx. If you make δx too small, the finite precision (#digits) of floating-point arithmetic leads to catastrophic cancellation errors.
Lecture 4
Outline
Part 0: To begin with, spent a few minutes talking about the last few sections of the Finite Difference (Jupyter notebook) from last lecture: higher-order finite-difference rules and finite differences in higher dimensions (e.g. for gradients).
Part 1: Generalizing gradients to scalar functions f(x) for x in arbitrary vector spaces x ∈ V. The key thing is that we need not just a vector space, but an inner product x ⋅ y (a “dot product”, also denoted ⟨x,y⟩ or ⟨x|y⟩); V is then formally called a Hilbert space. Then, for any scalar function, since df = f’(x)[dx] is a linear operator mapping dx ∈ V to scalars df ∈ ℝ (a “linear form”), it turns out that it must be a dot product of dx with “something”, and we call that “something” the gradient! That is, once we define a dot product, then for any scalar function f(x) we can define ∇f by f’(x)[dx] = ∇f ⋅ dx. So ∇f is always something with the same “shape” as x (the steepest-ascent direction).
Defined the most obvious inner product of m × n matrices: the Frobenius inner product A\(\cdot\)B = sum(A\(\cdot\ast\)B) = trace(AᵀB) = vec(A)ᵀvec(B), the sum of the products of the matrix entries. This also gives us the “Frobenius norm” ‖A‖² = A ⋅ A = trace(AᵀA) = ‖vec(A)‖², the square root of the sum of the squares of the entries. Using this, we can now take the derivatives of various scalar functions of matrices, e.g. we considered
- f(A) = ‖A‖ ⥰ ∇f = A/‖A‖
- f(A) = xᵀAy ⥰ ∇f = xyᵀ (for constant x, y)
- f(A) = det(A) ⥰ ∇f = det(A)(A⁻¹)ᵀ = adjugate(A)ᵀ: we will prove this later
Part 2: Applications of derivatives to multivariate root-finding and optimization. A key fact enabling large-scale optimization, i.e. min f(x) where f is a scalar function of many parameters x, is that computing ∇f has about the same cost as f, using what is variously called “reverse-mode” or “adjoint” or “backpropagation” differentiation algorithms, which essentially boil down to evaluating the chain rule left to right. Went through a few examples of this, oriented more at engineering/physics optimization (and “topology optimization”).
Lecture Notes
- Part 1: Generalized Gradients and Inner Products
- Part 2: Nonlinear Root-Finding, Optimization, and Adjoint-Method Differentiation (PDF)
Further Readings (Part 2)
- Prof. Steven Johnson gave another overview of optimization problems (PDF) in 18.335 Introduction to Numerical Methods.
- There are many textbooks on nonlinear programming, including Nonlinear Programming (by Bertsekas) and specialized books on Convex Optimization (by Boyd and Vandenberghe), Introduction to Derivative-Free Optimization (by Conn, Scheinberg, and Vicente), etc.
- A useful review of topology-optimization methods can be found in Topology Optimization Approaches (by Sigmund and Maute).
- See the Notes on Adjoint Methods (PDF) and The Adjoint Method for Differentiating Complex Computations (PDF) from 18.335 Introduction to Numerical Methods.
Lecture 5
Lecture Notes
- Part 0: Norms and Derivatives: Why a Norm of the Input and Output Are Needed to Define a Derivative
- Part 1: Derivative of Matrix Determinant and Inverse (HTML) (Julia source code)
- Part 2: Forward-Mode Automatic Differentiation via Dual Numbers (Jupyter notebook)
- Part 3: Forward and Reverse-Mode Automatic Differentiation on Computational Graphs
Further Readings (Part 1)
- There are lots of discussions of the derivative of a determinant online, involving the “adjugate” matrix det(A)A⁻¹. Not as well documented is that the gradient of the determinant is the cofactor matrix widely used for the Laplace expansion of a determinant. The formula for the derivative of log(det A) is also nice, and logs of determinants appear in surprisingly many applications (from statistics to quantum field theory).
- The Matrix Cookbook (PDF) contains many of these formulas, but no derivations.
- A nice application of d(det(A)) is solving for eigenvalues λ by applying Newton’s method to det(A - λI) = 0, and more generally one can solve det(M(λ)) = 0 for any function Μ(λ) — the resulting roots λ are called nonlinear eigenvalues (if M is nonlinear in λ), and one can apply Newton’s method using the determinant-derivative formula here.
Further Readings (Part 2)
- Googling “automatic differentiation” will turn up many, many resources—this is a huge practical field these days. ForwardDiff.jl (described in detail by “Forward-Mode Automatic Differentiation in Julia”) uses dual number arithmetic similar to lecture to compute derivatives.
- See also the article “How to Differentiate with a Computer,” or google “dual number automatic differentiation” for many other reviews.
Further Readings (Part 3)
- See Backpropagation through Back Substitution with a Backslash (PDF) about backpropagation on graphs, this blog post on calculus on computational graphs for a gentle review, and Columbia University’s Course Notes on Computational Graphs, and Backpropagation (PDF) for a more formal approach.
- Implementing automatic reverse-mode AD is much more complicated than defining a new number type, unfortunately, and involves a lot more intricacies of compiler technology. See also Chris Rackauckas’s blog post on Engineering Trade-Offs in Automatic Differentiation: from TensorFlow and PyTorch to Jax and Julia, and Chris’s discussion post on AD limitations.
Lecture 6
Lecture Notes
- Part 1: An Introduction to (Local) Sensitivity Analysis for (Ordinary) Differential Equations (PDF) (Courtesy of Frank Schäfer. Used with permission.)
- Part 2: Calculus of Variations
Further Readings (Part 1)
- A classic reference on adjoint-method (reverse-mode/backpropagation) differentiation of ODEs (and generalizations thereof), using notation similar to that used today, is Adjoint Sensitivity Analysis for Differential-Algebraic Equations: The Adjoint DAE System and Its Numerical Solution (PDF).
- See also the SciMLSensitivity.jl package for sensitivity analysis with Chris Rackauckas’s amazing DifferentialEquations.jl software suite for numerical solution of ODEs in Julia, along with his notes Differentiable Programming and Neural Differential Equations.
- There is a nice YouTube lecture on Adjoint State Method for an ODE, again using a similar notation.
- A discrete version of this process is adjoint methods and sensitivity analysis for recurrence relations (PDF) (MIT course notes), in which case one obtains a reverse-order “adjoint” recurrence relation.
Further Readings (Part 2)
- There are many resources on the “calculus of variations”, which refers to derivatives of f(u) = ∫F(u,u′,…)dx for functions u(x), but we saw that it is essentially just a special case of our general rule df = f(u + du) - f(u) = f′(u)[du] = ∇f ⋅ du when du lies in a vector space of functions. Setting ∇f to find an extremum of f(u) yields an Euler–Lagrange equation, the most famous examples of which are probably Lagrangian mechanics and also the Brachistochrone problem, but it also shows up in many other contexts such as optimal control.
- A very readable textbook on the subject is Calculus of Variations by Gelfand and Fomin.
Lecture 7
Lecture Notes
- Part 1: Derivatives of Random Functions (PDF) (Courtesy of Gaurav Arya. Used with permission.)
- Part 2: Second Derivatives, Bilinear Forms, and Hessian Matrices
Further Readings (Part 1)
- The idea of computing gradients of programs with a sampleable random output is called Monte-Carlo Gradient Estimation; the link leads to a nice survey.
- This Stack Exchange answer gives a concise, worked-out example of the reparameterization trick applied to a toy program.
- The frontier of simulation-based inference gives an overview of stochastic simulations across many domains of science, and discusses attempts to deal with the fact that it is much easier to sample them than to exactly compute a “likelihood.”
- Auto-Encoding Variational Bayes introduces the variational autoencoder, and introduces the reparameterization trick for use in training it.
- Automatic differentiation of programs with discrete randomness describes an approach for generalizing derivatives of continuous random functions based on the “reparameterization trick” to discrete random functions. StochasticAD.jl is an associated code package that implements the stochastic triples we played with at the end of class.
Further Readings (Part 2)
- Bilinear forms are an important generalization of quadratic operations to arbitrary vector spaces, and we saw that the second derivative can be viewed as a symmetric bilinear forms. This is closely related to a quadratic form, which is just what we get by plugging in the same vector twice, e.g. the f’’(x)[δx,δx]/2 that appears in quadratic approximations for f(x + δx) is a quadratic form. The most familiar multivariate version of f’’(x) is the Hessian matrix; Khan Academy has an elementary quadratic approximation.
- Positive-definite Hessian matrices, or more generally definite quadratic forms f″, appear at extrema (f′ = 0) of scalar-valued functions f(x) that are local minima; there a lot more formal treatments (PDF) of the same idea, and conversely Khan Academy has the simple 2-variable version where you can check the sign of the 2 × 2 eigenvalues just by looking at the determinant and a single entry (or the trace). There’s a nice Stack Exchange discussion on why an ill-conditioned Hessian tends to make steepest descent converge slowly; some Toronto Optimization (PDF) may also be helpful.
- See e.g. these Stanford notes on Sequential Convex Programming (PDF) using trust regions (sec. 2.2). See 18.335 notes on Quasi-Newton Optimization: Origin of the BFGS Update (PDF). The fact that a quadratic optimization problem in a sphere has strong duality and hence is efficiently solvable is discussed in section 5.2.4 of the Convex Optimization. There has been a lot of work on automatic Hessian computation, but for large-scale problems you can ultimately only compute Hessian–vector products efficiently in general, which are equivalent to a directional derivative of the gradient, and can be used e.g. for Newton–Krylov methods.
Lecture 8
Lecture Notes
- Part 1: Derivatives of Eigenproblems (HTML) (Julia source code)
- Part 2: Forward and Reverse-Mode Automatic Differentiation on Computational Graphs (continued from Lecture 5) and interactive notebook (HTML)
- Part 3: some topics we didn’t cover
Further Readings (Part 1)
- Computing derivatives on curved surfaces (“manifolds”) is closely related to tangent spaces in differential geometry. The effect of constraints can also be expressed in terms of Lagrange multipliers, which are useful in expressing optimization problems with constraints (see also chapter 5 of Convex Optimization by Boyd and Vandenberghe). In physics, first and second derivatives of eigenvalues and first derivatives of eigenvectors are often presented as part of “time-independent” perturbation theory in quantum mechanics, or as the Hellmann–Feynmann theorem for the case of dλ. The derivative of an eigenvector involves all of the other eigenvectors, but a much simpler “vector–Jacobian product” (involving only a single eigenvector and eigenvalue) can be obtained from left-to-right differentiation of a scalar function of an eigenvector, as reviewed in the Notes on Adjoint Methods for 18.335 (PDF).
Further Readings (Part 2)
- See Lecture 5, above.
Further Readings (Part 3)
There are many topics that we did not have time to cover, even in 16 hours of lectures. If you came into this class thinking that taking derivatives is easy and you already learned everything there is to know about it in first-year calculus, hopefully we’ve convinced you that it is an enormously rich subject that is impossible to exhaust in a single course. Some of the things it might have been nice to include are:
- When AD hits something it can’t handle, you may have to write a custom Jacobian–vector product (a “Jvp”, “frule”, or “pushforward”) in forward mode, and/or a custom rowvector–Jacobian product (a “vJp”, “rrule”, “pullback”, or Jacobianᵀ–vector product) in reverse mode. In Julia with Zygote AD, this is done using the ChainRules packages. In Python with JAX, this is done with
jax.custom_jvp
and/orjax.custom_vjp
, respectively. In principle this is straightforward, but the APIs can take some getting used to because of the generality that they support. - For functions f(z) of complex arguments z (complex vector spaces), you cannot take “ordinary” complex derivatives whenever the function involves the conjugate z̄ (e.g. for |z|, Re z, or Im z); this must occur if f(z) is purely real-valued (and not constant). One option is to write z = x + iy and treat f(z) as a two-argument function f(x,y) with real derivatives, but this can be awkward if your problem is “naturally” expressed in terms of complex variables (e.g. in the Fourier frequency domain). A common alternative is “CR calculus” (or “Wirtinger calculus”), in which you write df = (∂f/∂z)dz + (∂f/∂z̄)dz̄ as if z and z̄ were independent variables. This can be extended to gradients, Jacobians, steepest-descent, and Newton iterations, for example. A nice review can be found in The Complex Gradient Operator and the CR-Calculus by Ken Kreuz-Delgado.
- Many, many more derivative results for matrix functions and factorizations can be found in the literature, some of them quite tricky to derive. For example, a number of references are listed in this GitHub issue for the ChainRules package.
- Another important generalization of differential calculus is to derivatives on curved manifolds and differential geometry, leading to the exterior derivative.
- When differentiating eigenvalues λ of matrices A(x), a complication arises at eigenvalue crossings (multiplicity k > 1), where in general the eigenvalues (and eigenvectors) cease to be differentiable. (More generally, this problem arises for any implicit function with a repeated root.) In this case, one option is to use an expanded definition of sensitivity analysis called a generalized gradient (a k × k matrix-valued linear operator G(x)[dx] whose eigenvalues are the perturbations dλ); see e.g. Cox (1995), Seyranian et al. (1994), and Stechlinski (2022). (Physicists call this idea degenerate perturbation theory (PDF).) A recent formulation of similar ideas is called a lexicographic directional derivative; see Nesterov (2005) and Barton et al. (2017). Sometimes, optimization problems involving eigenvalues can be reformulated to avoid this difficulty by using SDP constraints (Men et al. 2014). For a defective matrix the situation is worse: even the generalized derivatives blow up, because dλ is proportional to the square root of the perturbation ‖dA‖.
- Famous generalizations of differentiation are the “distributional” and “weak” derivatives, for example to obtain Dirac delta “functions” by differentiating discontinuities. This requires changing not only the definition of “derivative” but also changing the definition of “function,” as reviewed in When Functions Have No Value(s): Delta Functions and Distributions (PDF).