alt.hn

1/15/2025 at 12:37:35 AM

Transformer^2: Self-Adaptive LLMs

https://sakana.ai/transformer-squared/

by hardmaru

1/16/2025 at 6:04:45 PM

Does anyone else find their results don't match their claims? In many cases the base model or a simple LoRa beats their proposed method. The few times theirs wins, the difference is very small. I feel like some of these "wins" are more sampling error than any significant improvement.

I'm always happy to see publishing of negative results, but it seems like they are selling what are negative results as positive results.

by RevEng

1/15/2025 at 4:27:21 AM

This sounds like MoE and maybe a bit of chain-of-thought. Curious what someone with more domain expertise thinks about this

If they can test against Llama 70B and Mistral 7B, they ought to compare against Mistral 8x7b imho

by verdverm

1/15/2025 at 1:13:29 PM

I'm not an expert, but MoE models perform better at continuous learning, because they are less prone to catastrophic forgetting.

by imtringued

1/15/2025 at 4:16:26 AM

Great research here. Contextual real-time weight modification is definitely one of the breakthroughs required for AGI. Why create a LoRA when you can generate one on the fly suited to the task at hand?

by wildermuthn

1/15/2025 at 4:30:53 AM

It does not seem like they are doing inference time weight changes, to the tune of running backprop. It sounds more like they are applying a pre-trained vector to the model, and select that vector based on the input, in a two step process

by verdverm

1/15/2025 at 4:52:12 AM

That’s my general understanding as well, but it isn’t a large conceptual leap to go from real-time selection of pretrained “z-vectors” to real-time generation of the same. The larger conceptual breakthrough, with demonstration of its effectiveness, is the big success here.

by wildermuthn

1/15/2025 at 1:03:35 PM

While not a large conceptual leap, the real-time generation of "z-vectors" is not cheap in terms of compute or data requirements, the latter of which I see as the main issue. How are you going to generate the vector from a single real-time input?

I still have yet to see anything that dissuades me from agreeing with Yann LeCun when he says Transformers are fundamentally limited. We won't get creativity, reasoning, or even move past hallucinations without a major breakthrough

by verdverm

1/15/2025 at 8:14:55 PM

How do the o3 results fit in context of this perspective?

by mordymoop

1/15/2025 at 8:57:32 PM

They do not change it, from what I have seen, o3 is more hype and marketing than a meaningful step towards models which can exhibit real creativity and reasoning as humans perform it (rather than perceive it, which is the root of the hype)

For example, a small child is completely capable of being told "get in the car" and can understand, navigate, open the door, and get in, with incredibly little energy usage (maybe about the amount of a single potato chip/crisp)

Now consider what I have been working on recently (1) evaluating secops tools from both a technical and business perspective (2) prototyping and creating an RFC for the next version of our DX at the org. They are very far from this capability because it involves so many competing incentives, trade offs, and not just the context of the current state of code, but also the history and vision. Crafting that vision is especially beyond what a foundation in transformers can offer. They are in essence an averaging and sequence prediction algorithm

These tools are useful, even provide an ROI, but by no means anywhere close to what I would call intelligent.

by verdverm

1/15/2025 at 9:16:10 AM

The interesting thing here is that the human brain also seems to use pretrained ... things. For vision, use the visual subsystem. For hearing, use the auditory subsystem. For movement ... you get the point. Plus you can combine these pretrained ... things, so for example for complex movement, like balancing on a tightrope, multiple subsystems are used (try standing on one leg with your eyes closed).

Z-vectors are of course nothing like the subsystems in your brain, but general the approach is certainly similar to how the brain works.

by mtts

1/15/2025 at 9:25:23 AM

> things

Senses?

by dleeftink

1/15/2025 at 10:38:59 AM

For sight and hearing, yes, but is "language use" a sense?

by mtts

1/15/2025 at 1:06:29 PM

In the strict sense, no, but as a system of communication, yes; organisms need some form of sensory perception to communicate or 'sense' language.

by dleeftink

1/15/2025 at 9:12:40 AM

Sort of. According to the text they can use multiple z-vectors (sets of weights that select for parts of the system to be used to answer a specific question) simultaneously, using a "simple optimization algorithm" to determine the relative weight for each of these vectors.

by mtts

1/15/2025 at 7:54:53 AM

>Contextual real-time weight modification is definitely one of the breakthroughs required for AGI.

It's already been invented: https://arxiv.org/abs/2202.05780 . That design is just very inefficient to scale up / use as a transformer backbone.

by logicchains

1/15/2025 at 10:04:16 AM

Why not, as each new task comes up, and then weights are revalued, save those weights and keep them for reference as priors for similar future tasks? As the model is exposed to new data the average of the set of priors of things the model thinks is similar might move closer to the posterior making the model quicker and more able to arrive at good outcomes. I suppose storage might be an issue.

by mnky9800n

1/15/2025 at 11:41:51 AM

I'm wondering if you could fine tune the model on an aggregate of a temporal slice of revalued weights? Something analogous to REM sleep's involvement in embedding the days events into long term memory.

by magospietato

1/15/2025 at 7:35:04 PM

Sieve the temporary backprop interim weights as a function of its loss of varentrophy relative to its place in the revalued weights.

Remove the bottom weights dynamically based on the local gradient in varentrophy so that internal dissonance ("doubt") can be selected against.

"Preference Optimization" but with more opportunities for meta-optimization.

by Jerrrry

1/15/2025 at 2:06:50 PM

thats just mixture of experts

by QuadmasterXLII

1/15/2025 at 2:36:29 PM

i thought mixture of experts didn't update itself with new sets of weights and was just a collection of already trained networks/weights? I could be wrong.

by mnky9800n

1/15/2025 at 2:54:09 PM

Well, that depends in whether you keep training it

by QuadmasterXLII

1/15/2025 at 2:57:05 PM

perhaps they should always be training and never static. haha. i allegedly grow wiser in my age, why not neural networks?

by mnky9800n

1/15/2025 at 3:13:28 PM

One weakness of this method is the storage of decomposed UV from W. My linear algebra is rusty, but it seems required if you want to scale in that U projected subspace, hence double your weight memory footprint (that has been said, U / V should be easier to quantize from information theory perspective). I also think MoE is more principled if you want to have experts activations. But I understand that Sakana's research focus mostly is about adapting existing pretrained models, not to do it from scratch.

by liuliu

1/15/2025 at 10:17:12 AM

> Transformer² represents a significant milestone in the evolution of AI systems.

Coming from a math background, it always amazes me to see how people in AI/ML brag about their papers. If someone wrote:

> My paper represents a significant milestone in the evolution of algebraic geometry/ergodic theory/combinatorics

it would be a laughing stock for the math community.

by E_Bfx

1/15/2025 at 1:09:47 PM

They aren't just researchers, there is a company that took on $200M in a Series A...

https://sakana.ai/series-a/

by verdverm

1/15/2025 at 1:48:54 PM

Why is this relevant when presenting scientific research? Or is the point of your comment to say, they are incentivized to "brand" their research in a way which is attractive to a VC audience?

by mccoyb

1/15/2025 at 1:52:47 PM

It's offered as one possible explanation for the tone or style of the language that GP commented on. I don't think their observation applies to ML research at large, this group seems to be more eccentric in their writing (see their history of submissions on HN and their blog more generally)

by verdverm

1/15/2025 at 8:36:10 PM

> Why is this relevant when presenting scientific research?

I’m guessing that the difference lies in the potential value extraction possibilities from the idea.

If comparing the transformers paper to an algorithm or geometry, that is not used by anyone, I think the differences are obvious from this perspective.

However, if that paper on geometry led to something like a new way of doing strained silicon for integrated circuit design that made manufacturing 10 times cheaper and the circuit 10 times faster, then that would be more important then that would the transformers one.

by sroussey

1/15/2025 at 1:53:41 PM

> Or is the point of your comment to say, they are incentivized to "brand" their research in a way which is attractive to a VC audience?

Yes

by KolmogorovComp

1/15/2025 at 2:08:51 PM

Anyone can be a researcher/scientist if they pass peer review at a reputable journal or conference. That's just how it is.

by Der_Einzige

1/15/2025 at 5:41:36 PM

The bar seems to be much lower than getting a peer reviewed paper published at a reputable outlet

This particular paper is not peer reviewed or published beyond a preprint on arxiv

by verdverm

1/15/2025 at 10:34:04 AM

In ML results are often a score (accuracy or whatever) which makes it more gamefied

It's common to have competitions where the one with the highest score in the benchmark "wins". Even if there is no formal competition, it's very important being the SOTA model.

Results are more applicable to the real world, and more "cool" subjectively (I don't think there's a 2 minutes paper equivalent for math?), which increases ego.

And often authors are trying to convince others to use their findings. So it's partly a marketing brochure.

by redox99

1/15/2025 at 12:37:12 PM

- There is also (but on a smaller scale) a gamification of math with bounties (https://mathoverflow.net/questions/66084/open-problems-with-...) but when a result is proved you cannot prove it "better than the first time". So it is more a "winner take it all" situation. - I am not sure but the "2-minute papers" equivalent would be poster sessions, a must-do for every Ph.D. student - For the marketing side, there are some trends in math, and subtly researchers try to brand their results so they become active research fields. But since it cannot be measured with GitHub stars or Hugging Face downloads, it is more discreet

by E_Bfx

1/15/2025 at 11:11:01 AM

Especially when the results are so modest! "Significant" doesn't seem like unfalsifiable hype here, it's just wrong.

by aithrowawaycomm

1/15/2025 at 1:11:12 PM

Yeah the naming implies a significant breakthrough, but this is just an incremental stepping stone that will be forgotten in time.

by imtringued

1/17/2025 at 2:17:38 PM

Obvious next step: use this kind of model in AI-Scientist, Sakana AI's AI-powered automated researcher project.

by anticensor

1/15/2025 at 2:16:26 PM

Can someone please enlighten me how this is any different from Mixture of Experts? Because I don't see any difference at all.

by ghc

1/15/2025 at 3:17:29 PM

The router is manually designed (see their cem function). Also, the experts are not separate weights, just different scales of it's singular values.

by liuliu

1/15/2025 at 10:41:25 PM

Thank you, I was missing that second part.

by ghc

1/15/2025 at 9:39:37 AM

It is discomforting to read, in the first paragraph, that "dynamical adjustment of weights" is justified as "adaptation". Clearly it is a sought milestone to have «a future where AI models are no longer static»: but the chief reason remains, "intelligent systems reprocesses their body of knowledge and change it to improve it" - it is anterior to "adaptation to environment", it is "maintenance of the body of knowledge (of the world model)": it is the continuous practice of "thinking about things", "pondering", "reflecting", "using judgement"...

There is not just a simple «lifelong learning»: the whole past experience is still productive, requiring analysis, not "solved".

Anyway: the directions seem good.

Edit: equally interesting in another direction is the automated analysis of the internal subagents, «break[ing] down the vast, complex knowledge stored in the LLM into smaller, meaningful, and independent pieces (e.g., the different pathways or components for math, language understanding, etc)». Should not there be a general study of the dissection of systems with seemingly emergent intelligence, doing on LLMs like we do on C. Elegans?

by mdp2021

1/15/2025 at 7:06:57 PM

Worth noting is that the original inventor of the transformer is part of this team

by qoez

1/15/2025 at 4:14:47 PM

> https://sakana.ai/

I like that background animation. Seems like there's an opportunity for tiny logic gates and some punny swarm behavior.

by qrsjutsu

1/15/2025 at 1:04:00 PM

Is this real? Or is this a hustler type paper/company.

by justanotherjoe

1/15/2025 at 2:58:15 PM

The paper's infographics seems more PR than scientific

by SubiculumCode

1/15/2025 at 6:59:58 AM

It's all very interesting but those pictures look pretty bad. Clear visible artifacts, awful shapes.

by Vampiero

1/15/2025 at 8:00:59 AM

The ideas in the paper have been implemented and tested. The authors conducted experiments on several tasks (math, coding, reasoning, and visual question answering) and showed that their approach works better than previous methods like LoRA.

Key ideas (in simple terms):

1. What’s the problem?

    - Fine-tuning LLMs for every new task is slow, expensive, and often doesn't generalize well.
    - Models trained on one task may perform poorly on others, especially unseen ones.
    - Current methods (like LoRA) can add new capabilities but aren't efficient enough.
2. The solution:

    - Transformer² uses a new fine-tuning method called Singular Value Fine-tuning (SVF). This focuses on adjusting only certain parts of the model’s "weight matrices" rather than changing everything.
    - By tweaking specific components (called "singular values"), it trains smaller, efficient "expert" modules that specialize in particular types of tasks.
3. How it works:

    - Training phase: Train these smaller expert modules offline using reinforcement learning (RL) to specialize in tasks like coding, math, or reasoning.
    - Inference phase: When a new input is given, the system analyzes the task (e.g., “Is this a math or coding problem?”) in the first pass. Based on this, it combines the right expert modules and adapts the model’s behavior in the second pass.
4. Three adaptation strategies:

    - Prompt-based: Use a cleverly designed text prompt to figure out the task type and pick the right expert module.
    - Classifier-based: Train a separate model to classify tasks and match them to experts.
    - Few-shot adaptation: Look at a small number of examples (few-shot learning) to dynamically combine expert modules for the best results.
5. Efficiency:

    - The system uses fewer parameters than traditional fine-tuning methods like LoRA.
    - Adaptation works even on small datasets without overfitting or forgetting older tasks.

by tzury