6/4/2026 at 11:42:00 PM
Hint for authors: when discussing linear algebra (or really most other kinds of math), follow normal conventions. In this case, the convention would be that - (the minus sign) means subtraction. It does not mean "and also", especially when you sandwich it between two variables that represent matrices.I read the paper with much head scratching all the way through sections 1 and 2 and part of 3 before I figured out that, no, really, the description "Q-K=V" does not mean "Q minus K equals V" (the head scratching was because a bunch of their descriptions and symmetry comments really make little sense if you think "Q minus K equals V"). If you want to say that "K equals V", please spell it "K=V" :)
I am curious whether it makes any sense at all to enforce a more general linear constraint on the query, key and value attention matrices along the line of Q-K=V.
It is an entertaining paper. I admit I'm surprised that K=V appears to work as well as it does -- it seems like it's almost enforcing a sort of model where the query is a guess as to what the value is and the attention head returns a (softmaxed) value that is closest to the query's guess. Maybe it works because the sequences are short and the dimension is high and there's plenty of room for interesting results to fit in the merged key/value space.
by amluto
6/5/2026 at 2:23:32 AM
It confused me too.A n-tuple notation would have been more readable and mathematically accurate like (Q=K, V), (Q, K=V), and (Q=K=V).
by kanbankaren
6/5/2026 at 12:21:42 AM
> Maybe it works because the sequences are short and the dimension is high and there's plenty of room for interesting results to fit in the merged key/value space.In fact, on the second last page of the paper, they discuss this very problem. There is a clear correlation between performance and increasing sequence lengths for the Q-K=V model. While limited to a tight n=3 sample between 512, 1024, 2048 lengths, the degradation decreases from 5.4% to 2.2% as context is increased, suggesting that it is unlikely shorter sequences are the reason K=V performs acceptably.
by amemi
6/4/2026 at 11:48:37 PM
Yeah the weird notation confused me too. Their own Limitations also says their experiments are too small. I am quite curious how it will play out big now, but unironically I cannot afford the hardware lol.by xiaoyu2006
6/5/2026 at 7:15:42 PM
I think the primary reason it works is because the difference between K and Q, which is not all that obvious is that it’s allowing the model to have an asymmetric relationship between tokens, so one token can attend to another without the reverse being true. It seems to me if you just have a single value that you’re representing symmetric relationship, which might degrade the quality of reasoning over a set of tokens, but also is probably possible.by joshuamoyers
6/5/2026 at 7:18:35 PM
it seems to be something that’s similar to the class of optimizations associated with with linear or state space attention when things models often do is once they figure out an optimization like this they create a ratio between full resolution blocks and blocks that have the optimization implemented.by joshuamoyers
6/5/2026 at 5:36:28 AM
Would it have killed them to use a comma instead?!by ssivark
6/5/2026 at 4:58:42 AM
Wha? Why didn't they use Q=K=V for that?by sfink
6/5/2026 at 12:42:05 PM
The notation is supposed to mean: you have a matrix Q, and also a shared K=V matrix.I agree with GP that it's super confusing to us the minus sign as a delimiter between formulas. The tuple notation suggested elsewhere would be way clearer.
by simsla
6/5/2026 at 2:50:49 AM
Its not a math paperby semiinfinitely
6/5/2026 at 4:50:45 AM
Does it not being an English philology paper mean they are free to spell “fish” as “ghoti”?by volemo
6/5/2026 at 5:23:39 AM
Definitely an applied maths paper given that it has been published under CS/ML and been accepted at ICML.by srean
6/5/2026 at 7:42:20 PM
Its not even applied mathby semiinfinitely
6/5/2026 at 4:28:32 AM
It’s not typeset in math mode so you can’t expect the hyphen to correspond to minus.by canjobear
6/5/2026 at 9:29:10 AM
By this logic a lot of applied maths papers become “does not compile” :Dby conformist
6/5/2026 at 1:05:12 PM
Cannot tell whether sarcasm or not.by Sharlin