18.S096 | January IAP 2022 | Undergraduate
Matrix Calculus for Machine Learning and Beyond

### Lecture Notes

• Part 1: Introduction (PDF)
• Part 2: Derivatives as Linear Operators [notes not available]

• matrixcalculus.org is a fun site to play with derivatives of matrix and vector functions.
• The Matrix Cookbook (PDF) has a lot of formulas for these derivatives, but no derivations.
• Notes on Vector and Matrix Differentiation (PDF) are helpful.
• Fancier math: The perspective of derivatives as linear operators is sometimes called a Fréchet derivative and you can find lots of very abstract (what I’m calling “fancy”) presentations of this online, chock full of weird terminology whose purpose is basically to generalize the concept to weird types of vector spaces. The “little-o notation” o(δx) we’re using here for “infinitesimal asymptotics” is closely related to the Big O notation used in computer science, but in computer science people are typically taking the limit as the argument (often called “n”) becomes very large instead of very small. A fancy name for a row vector is a “covector” or linear form, and the fancy version of the relationship between row and column vectors is the Riesz representation theorem, but until you get to non-Euclidean geometry you may be happier thinking of a row vector as the transpose of a column vector.

### Lecture Notes

• The terms “forward-mode” and “reverse-mode” differentiation are most prevalent in automatic differentiation (AD), which we will cover later in this course.
• You can find many, many articles online about backpropagation in neural networks.
• There are many other versions of this, e.g. in differential geometry the derivative linear operator (multiplying Jacobians and perturbations dx right-to-left) is called a pushforward, whereas multiplying a gradient row vector (covector) by a Jacobian left-to-right is called a pullback
• This video on “Understanding Automatic Differentiation” by Dr. Mohamed Tarek also starts with a similar left-to-right (reverse) vs right-to-left (forward) viewpoint and goes into how it translates to Julia code, and how you define custom chain-rule steps for Julia AD.

### Lecture Notes

• Computing derivatives on curved surfaces (“manifolds”) is closely related to tangent spaces in differential geometry. The effect of constraints can also be expressed in terms of Lagrange multipliers, which are useful in expressing optimization problems with constraints (see also Chapter 5 of Convex Optimization by Boyd and Vandenberghe).
• In physics, first and second derivatives of eigenvalues and first derivatives of eigenvectors are often presented as part of “time-independent” perturbation theory in quantum mechanics, or as the Hellmann-Feynmann theorem for the case of dλ.
• The derivative of an eigenvector involves all of the other eigenvectors, but a much simpler “vector-Jacobian product” (involving only a single eigenvector and eigenvalue) can be obtained from left-to-right differentiation of a scalar function of an eigenvector, as reviewed in the 18.335 Notes on Adjoint Methods (PDF).

• Bilinear forms are an important generalization of quadratic operations to arbitrary vector spaces, and we saw that the second derivative can be viewed as a symmetric bilinear form. This is closely related to a quadratic form, which is just what we get by plugging in the same vector twice, e.g. the f’’(x)[δx,δx]/2 that appears in quadratic approximations for f(x+δx) is a quadratic form.
• The most familar multivariate version of f’’(x) is the Hessian matrix