4/1/2026 at 5:35:35 AM
Not sure if I buy it. First, SVD decomposition to obtain U, Σ, V is computationally expensive, so it would work only if we are not finetuning very big models.But my real concern comes at the results. The "13 parameters" looks like bait, because it is one result of finetuning a model on a very simple math benchmark, grade-school-math (GSM8K), an already very saturated benchmark on every model. Besides, it seems to happen only for the qwen family model... It looks like GSM8K was part of the training set of the qwen model, and this tinylora finetuning did the last adjustments to perfectly reflect that overtraining.
by dollo_7
4/1/2026 at 6:01:26 AM
Fair points, especially on GSM8K saturation and Qwen possibly already sitting close to the solution. That said, even if this is mostly "last-mile alignment", the fact that it can be done with such a tiny signal is still interesting, it suggests the gap between capability and behavior might be much smaller (and cheaper to bridge) than we assume.by sachaa
4/1/2026 at 3:06:42 PM
> the gap between capability and behavior might be much smallerCan you elaborate a bit on what you mean with the gap?
by endofreach
4/2/2026 at 3:51:34 AM
[dead]by romerocruzsa
4/1/2026 at 3:16:16 PM
I've done a lot of exploratory work with Stable Diffusion LoRAs, and I actually do buy that there's some juice here, though it's almost certainly not nearly as good as other techniques can be. In particular, this technique will likely avoid the intruder dimension problem which plagues naive LoRA. SVD is expensive, but you only have to do it once at the beginning of training.I haven't done much research lately, but when I was working on it, I was having substantial success training an adapter of the form U_k @ P @ A, where U_k was the top k left singular vectors of the underlying weight, and then P and A were your typical LoRA projection matrices.
The 13 parameters are kind of misleading here; the real juice is going to be in the P_i fixed random matrices. My suspicion is that they are overfitting to the benchmark, but they almost certainly are observing a real gain in model capacity that is largely due to avoiding the intruder dimension problem.
by cheald
4/1/2026 at 12:41:31 PM
They're using the truncated SVD, not the full variant, that's computationally cheaper.by sorenjan
4/1/2026 at 5:57:30 AM
Yeah, my big problem with the paper is it just might be an artifact of qwen's training process.by robrenaud
4/1/2026 at 10:39:27 AM
In all fairness most of the unique stuff I can do is probably an artifact of my training process, so it seems unfair to deny an LLM the same accomodation.by taneq
4/1/2026 at 1:25:42 PM
How much did your training cost society?by nativeit
4/1/2026 at 4:24:06 PM
This got me thinking, and it might actually even be a comparable amount. Let's estimate 12 years of schooling run at minimum $100,000 per student, at least in the US [1], and then add onto that number whatever else you may do after that, i.e. a bunch more money if paid (college) or "unpaid" (self-taught skills and improvements) education, and then the likely biggest portion for white-collar workers, yet hard-to-quantify, in experience and "value" professional work will equip one with.Now divide the average SOTA LLM's training cost (or a guess, since these numbers aren't always published as far as I'm aware) by the number of users, or if you wanted to be more strict, the number of people it's proven to be useful for (what else would training be for), and it might not be so far off anymore?
Of course, whether it makes sense to divide and spread out the LLMs' costs across users in order to calculate an "average utility" is debatable.
[1] https://www.publicschoolreview.com/average-spending-student-...
by msdz