alt.hn

3/29/2025 at 8:00:33 PM

Matrix Calculus (For Machine Learning and Beyond)

https://arxiv.org/abs/2501.14787

by ibobev

3/29/2025 at 11:19:30 PM

If you want to get handy with matrix calculus, the real prerequisite is being comfortable with Taylor expansions and linear algebra.

In a graduate numerical optimization class I took over a decade ago, the professor spent 10 minutes on the first day deriving some matrix calculus identity by working out the expressions for partial derivatives using simple calculus rules and a lot of manual labor. Then, as the class was winding up, he joked and said "just kidding, don't do that... here's how we can do this with a Taylor expansion", and proceeded to derive the same identity in what felt like 30 seconds.

Also, don't forget the Jacobian and gradient aren't the same thing!

by sfpotter

3/30/2025 at 1:17:55 AM

> Also, don't forget the Jacobian and gradient aren't the same thing!

Every gradient is a Jacobian but not every Jacobian is a gradient.

If you have a map f from R^n to R^m then the Jacobian at a point x is an m x n matrix which linearly approximates f at x. If m = 1 (namely if f is a scalar function) then the Jacobian is exactly the gradient.

If you already know about gradients (e.g. from physics or ML) and can't quite wrap your head around the Jacobian, the following might help (it's how I first got to understand Jacobians better):

1. write your function f from R^n to R^m as m scalar functions f_1, ..., f_m, namely f(x) = (f_1(x), ..., f_m(x))

2. take the gradient of f_i for each i

3. make an m x n matrix where the i-th row is the gradient of f_i

The matrix you build in step 3 is precisely the Jacobian. This is obvious if you know the definition and it's not a mathematically remarkable fact but for me at least it was useful to demystify the whole thing.

by C-x_C-f

3/30/2025 at 2:17:44 AM

For m = 1, the gradient is a "vector" (a column vector). The Jacobian is a functional/a linear map (a row vector, dual to a column vector). They're transposes of one another. For m > 1, I would normally just define the Jacobian as a linear map in the usual way and define the gradient to be its transpose. Remember that these are all just definitions at the end of the day and a little bit arbitrary.

by sfpotter

3/30/2025 at 1:15:36 PM

I'd say a gradient is usually a covector / one-form. It's a map from vector directions to a scalar change. ie. df = f_x dx + f_y dy is what you can actually compute without a metric; it's in T*M, not TM. If you have a direction vector (e.g. 2 d/dx), you can get from there to a scalar.

by oddthink

3/30/2025 at 3:00:06 PM

I'm not a big Riemannian geometry buff, but I took a look at the definition in Do Carmo's book and it appears that "grad f" actually lies in TM, consistent with what I said above. Would love to learn more if I've got this mixed up.

This would be nice, because it would generalize the "gradient" from vector calculus, which is clearly and unambiguously a vector.

by sfpotter

3/30/2025 at 7:25:35 PM

It's probably just a notation/definition issue. I'm not sure if "grad f" is 100% consistently defined

I'm a simple-minded physicist. I just know if you apply the same coordinate transformation to the gradient and to the displacement vector, you get the wrong answer.

My usual reference is Schutz's Geometrical Methods of Mathematical Physics, and he defines the gradient as df, but other sources call that the "differential" and say the gradient is what you get if you use the metric to raise the indices of df.

But that raised-index gradient (i.e. g(df)), is weird and non-physical. It doesn't behave properly under coordinate transformations. So I'm not sure why folks use that definition.

You can see difference by looking at the differential in polar coordinates. If you have f=x+y, then df=dx+dy=(cos th + sin th)dr + r(cos th - sin th)d th. If you pretend this is instead a vector and transform it, you'd get "df"=(cos th + sin th)dr + (1/r)(cos th - sin th)d th, which just gives the wrong answer.

To be specific, if v=(1,1) in cartesian (ex,ey), then df(v)=2. But (1,1) in cartesian is (1,1/r) in polar (er, etheta). The "proper" df still gives 2, but the "weird metric one" gives 1+1/r^2, since you get the 1/r factor twice, instead of a 1/r and a balancing r.

by oddthink

3/31/2025 at 6:27:35 PM

And I'm just a simple applied mathematician. For me, the gradient is the vector that points in the direction of steepest increase of a scalar field, and the Jacobian (or indeed, "differential") is the linear map in the Taylor expansion. I'll be curious to take a look at your reference: looks like a good one, and I'm definitely interested in seeing what the physicist's perspective is. Thanks!

by sfpotter

3/30/2025 at 12:40:40 AM

Can you give an example?

by edflsafoiewq

3/30/2025 at 6:40:05 AM

If you mean for how to use Taylor expansions and linear algebra, here's one I just made up.

Let's say I want to differentiate tr(X^T X), tr is the trace, X is a matrix, and X^T is its transpose. Expand:

    tr((X + dX)^T (X + dX)) = tr(X^T X) + 2 tr(X^T dX) + tr(dX^T dX).
Our knowledge of linear algebra tells us that tr is a linear map. Hence, dX -> 2 tr(X^T dX) is the linear mapping corresponding to the Jacobian of tr(X^T X). With a little more work we could figure out how to write it as a matrix.

by sfpotter

3/30/2025 at 6:21:35 AM

  https://math.stackexchange.com/questions/3680708/what-is-the-difference-between-the-jacobian-hessian-and-the-gradient

  https://carmencincotti.com/2022-08-15/the-jacobian-vs-the-hessian-vs-the-gradient/

by godelski

3/30/2025 at 5:59:59 AM

Check out this classic from 3b1b - How (and why) to raise e to the power of a matrix: https://youtu.be/O85OWBJ2ayo

by vismit2000

3/30/2025 at 10:33:51 AM

For those who prefer reading (I’ve not seen the video, but it seems related):

https://sassafras13.github.io/MatrixExps/

“Thanks to a fabulous video by 3Blue1Brown [1], I am going to present some of the basic concepts behind matrix exponentials and why they are useful in robotics when we are writing down the kinematics and dynamics of a robot.”

by kgwgk

3/30/2025 at 3:41:13 PM

They didn't show how to actually do it using matrix decomposition!

by esafak

3/29/2025 at 9:28:15 PM

Those looking for a shorter primer could consult https://arxiv.org/abs/1802.01528

by i_am_proteus

3/30/2025 at 3:28:54 AM

I've only skimmed through both of them, so I might be entirely incorrect here, but isn't the essential approach a bit different for both? The MIT one emphasis not to view matrices as tables of entries, but instead as holistic mathematical objects. So when they perform the derivatives, they try to avoid the "element-wise" approach of differentiation, while the one by Parr et Howard seems to do the "element-wise" approach, although with some shortcuts.

by Valk3_

3/30/2025 at 6:40:26 AM

I got the same impression as you the Bright, Edelman, and Johnson (MIT) notes seems more driven my mathematicians where I find the Parr and Howard paper wanting. Though I agree with them

  >  Note that you do not need to understand this material before you start learning to train and use deep learning in practice
I have an alternative version

  > You don't need to know math to train good models, but you do need to know math to know why your models are wrong. 
Referencing "All models are wrong"

I think another part is that the Bright, Edleman, and Johnson paper are also introducing concepts such as Automatic Differentiation, Root Finding, Finite Difference Methods, and ODEs. With that in mind it is far more important to be coming from the approach where you are understanding structures.

I think there is an odd pushback against math in the ML world (I'm a ML researcher). Mostly because it is hard and there's a lot of success you can gain without it. But I don't think that should discourage people from learning math. And frankly, the math is extremely useful. If we're ever going to understand these models we're going to need to do a fuck ton more math. So best to get started sooner than later (if that's anyone's personal goal anyways)

by godelski

3/30/2025 at 10:47:51 AM

Regarding the math in ML, what I would love to see (links if you have any) is a nuanced take on the matter, showing examples from both sides. Like in good faith discussing what contributions one can make with and without a strong math background in the ML world.

edit: On the math side I've encountered one that seemed unique, as I haven't seen anything like this elsewhere: https://irregular-rhomboid.github.io/2022/12/07/applied-math.... However, this only points out courses that he enrolled in his math education that he thinks is relevant to ML, each course is given a very short description and or motivation as to the usefulness it has to ML.

I like this concluding remarks:

Through my curriculum, I learned about a broad variety of subjects that provide useful ideas and intuitions when applied to ML. Arguably the most valuable thing I got out of it is a rough map of mathematics that I can use to navigate and learn more advanced topics on my own.

Having already been exposed to these ideas, I wasn’t confused when I encountered them in ML papers. Rather, I could leverage them to get intuition about the ML part.

Strictly speaking, the only math that is actually needed for ML is real analysis, linear algebra, probability and optimization. And even there, your mileage may vary. Everything else is helpful, because it provides additional language and intuition. But if you’re trying to tackle hard problems like alignment or actually getting a grasp on what large neural nets actually do, you need all the intuition you can get. If you’re already confused about the simple cases, you have no hope of deconfusing the complex ones.

by Valk3_

3/31/2025 at 5:48:22 AM

I think that author's next article does a great job explaining. While they don't say it with these words, I'd say that by learning math you are able to speak the same language as the model. This does wonders for interpreting what is going on and why it is making certain decisions. The "black box" isn't transparent, but neither is it so dark. And it the space is so dark that you should be trying to shed any light that you can on it.

https://irregular-rhomboid.github.io/2022/12/27/math-is-a-la...

by godelski

3/29/2025 at 8:24:59 PM

> The class involved numerous example numerical computations using the Julia language, which you can install on your own computer following these instructions. The material for this class is also located on GitHub at https://github.com/mitmath/matrixcalc

by westurner

3/29/2025 at 9:15:09 PM

The Matrix Cookbook [1] can be handy when learning this topic.

https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

by dandanua

3/30/2025 at 3:34:23 PM

I think this encourages "look it up" and "this layout is just a convention" when one convention is much more "natural" i.e. at a point x, a dual element df(x) acts on vectors y via df(x)(y) = < gradf(x), y>.

I wouldn't teach it this way, but I would definitely take the taylor expansion and define the grad vector as the one to make the best local linear approximation. This tells you that the grad lives in the same space, i.e. same dimensions.

Of course, you can always switch things around if you want to calculate things and put in transposes where they belong. But I find it insane to take the unnatural view point as standard, which I find a lot of papers do.

by refusingsalt

3/30/2025 at 8:43:01 AM

When I worked at the university, this used to be my go-to reference about matrix identities (including matrix calculus).

by hasley

3/29/2025 at 9:53:23 PM

Great course. I highly recommended anyone interested in this topic to check it out on the MIT website, taught by the same authors. They are great lecturers.

by windsignaling

3/29/2025 at 11:35:08 PM

I have skimmed it, and it looks very good. It is actually not solely about matrix calculus, but shows a practical approach to differentiation in different vector spaces with many examples and intuitions.

by Koncopd

3/29/2025 at 10:39:19 PM

What does calculus mean ?

by revskill

3/30/2025 at 2:45:53 PM

Calculus is the branch of mathematics that deals with continuous change. Broadly speaking there are two parts to it - differential calculus, which deals with rates of change and integral calculus which deals with areas, volumes and that sort of thing. Pretty early on you learn that these are essentially two sides of the same coin.

So in this particular instance since we are talking about matrix calculus it’s a type of multivariable calculus where you’re dealing with functions which take matrices as inputs and “matrix-valued functions” (ie functions which return matrices as an output).

Calculus is used for a lot of things but for example if you have a continuous function you can use calculus to find maxima and minima, inflection points etc. Since the focus here is machine learning, one of the most important things you want to be able to do is gradient descent to optimise some cost function by tweaking the weights on your model. The gradient here is a vector field which at every point in the space of your model points in the direction of steepest ascent[1]. So if you want to go down you take the gradient at a particular point and go exactly in the opposite direction. That means you know exactly what model weight tweaks will cause the biggest decrease in your cost function the next time you do a training run.

[1] To imagine the gradient, think of an old-school contour map like you’re going to do a hike or something. This is a “scalar field” (a map from spatial coordinates to a scalar value - altitude). The contour lines on the map link points of equal altitude. The gradient is a “vector field” which is a map from spatial coordinates to a vector. So imagine at every point on your contour map there was a little arrow that pointed in the direction of steepest ascent. Because this is matrix calculus you will be dealing with “matrix fields” (maps from spatial coordinates to matrices) as well. So for example say you did a measurement of stress in a steel beam. At each point you would have a “stress tensor” which says what the forces are at that point and in which direction they are pointing. This is a “tensor field” (map from spatial coordinates to a tensor) and a tensor is like a multidimensional matrix but with some additional rules about how it transforms.

by seanhunter

3/29/2025 at 10:40:19 PM

Time to schedule with a dentist

by odyssey7

3/30/2025 at 4:53:33 AM

wait what - another math textbook recommendation by academicians. ML and MLL are arts of tinkering not academic subjects.

Though Steven Johnson is the real deal and writes lots of code, Edelman is a shyster/imposter who used to ride the coattails of G. Strang and now shills for Julia where he makes most of his money. You don't need, and won't understand ML/LLM by reading textbooks.

1. If you want to have a little fun with ML/LLM, fire up Google Collab and run one of tutorials on the web - Karpathy, Hugging Face or PyTorch examples.

2. If you don't want to do, but just read for fun, Howard & Parr's essay as recommended by someone else here is much shorter and more succinct. https://explained.ai/matrix-calculus/ this link renders better

3. If you insist on academic textbooks, Boyd & Vandenberghe skips calculus and has more applications (engineering). Unfortunately, code examples are in Julia! https://web.stanford.edu/~boyd/vmls/vmls.pdf https://web.stanford.edu/~boyd/vmls/. link to python version

4. If you Want to become a tensor & differential programming ninja, learn Jax, XLA https://docs.jax.dev/en/latest/quickstart.html https://colab.research.google.com/github/exoplanet-dev/jaxop...

by FilosofumRex