5/22/2025 at 7:44:20 AM
I have no idea how it works actually (in google) but I wouldn't be surprised if it was just post-training because recently RWKV people did something similar: They replaced the whole attention mechanism with WKV (forward-only linear attention), and created such franken-stein just by post-training.The big wow moment about that is that it sort of implies that most of the useful knowledge is in the FFN, and attention itself is not that unique/important.
https://substack.recursal.ai/p/qwerky-72b-and-32b-training-l...
BTW: It could be also interesting to try use already trained attention and see how long the FFN itself takes in the gpt2 speedtraining (it would be against the rules but still very interesting IMHO - definitely something I'd like to read paper about) https://github.com/KellerJordan/modded-nanogpt
Also, I read yesterday that at some point, the embeddings across all of the models are (very) comparable/similar, and a simple converter can be trained, and if both of these statements are true maybe we could just train everything much faster just by sharing fixed embeddings and attentions.
by cztomsik
5/22/2025 at 9:35:24 AM
Ever notice that attention is (with the highest respect to the original researchers) "just" inputting the entire past of the network into a reverse-MoE neural network? (meaning the expert is selecting parts of the input instead of parts of the neural network to execute)In a way everyone knew this would work. Nobody did it because it's so inefficient even R and Python users thought that it would be ridiculously slow (or simply couldn't execute it enough to train to a reasonable extent)
by spwa4
5/22/2025 at 12:20:59 PM
Attention is just completely arbitrary way to split the network so the learning can be parallelized.What contributed more towards success in my opinion are "shortcut connections" through layers which enable more influence on early layers during learning.
by scotty79
5/22/2025 at 12:55:23 PM
> What contributed more towards success in my opinion are "shortcut connections" through layers which enable more influence on early layers during learning.For those who don't know, that is the idea behind ResNet (He et al., Deep Residual Learning for Image Recognition, https://arxiv.org/abs/1512.03385), one of the most influential papers in deep learning of all time.
Residual connections make it possible to train networks that are arbitrarily deep. Before ResNet, networks that were too deep were essentially not trainable due to vanishing or exploding gradients.
by grumbelbart2
5/22/2025 at 12:50:21 PM
> Also, I read yesterday that at some point, the embeddings across all of the models are (very) comparable/similar, and a simple converter can be trainedThat was from here: https://news.ycombinator.com/item?id=44054425
by cubefox
5/22/2025 at 6:39:23 PM
So is the famous "Attention is all you need" wrong?by jonahx
5/22/2025 at 9:16:33 AM
The relative unimportance of the exact SDPA attention in use in modern transformers is already known: https://arxiv.org/abs/2111.11418The FFN, normalization, and residual connections are absolutely irreplaceable -- but attention can be replaced with almost any other layer that shares information between tokens, such as pooling, convolution, random mixing, etc.
by slickytail
5/22/2025 at 9:28:18 PM
hm, residual is what I would not expect, can you elaborate why?by cztomsik
5/23/2025 at 12:17:17 AM
Avoids vanishing gradients in deeper networks.Also, most blocks with a residual approximate the identity function when initialised, so tend to be well behaved.
by simsla