1/18/2026 at 4:35:47 PM
If you want to prove (i.e. show that it works and/or it's faster in a real-world scenario) a new alternative to attention without breaking the bank then one of the best ways to do that would probably be to retrain an already existing model, just with swapped attention modules. Then once you have such a model you can do apples-to-apples benchmarks.This has been done successfully in the past:
https://huggingface.co/featherless-ai/QRWKV-72B
Note that this is a 72B model which would be very expensive to train from scratch, but here they did the conversion for less than $2000.
by kouteiheika
1/18/2026 at 8:23:41 PM
I'd say try the nanogpt speedrun. It's much easier to train, and gives you a better comparison vs optimized systems.by Herring
1/19/2026 at 2:13:06 AM
The linked paper tested nanoGPT with this new transformer:https://www.techrxiv.org/users/685780/articles/1375955-topol...
by naasking
1/19/2026 at 6:14:23 AM
thanks for linking.Yes the paper compares the new architecture (that is also a fork of my implementation of nanoGPT) with Karpathy's nanoGPT. There are also links to the code and bench used.
by tuned
1/19/2026 at 6:07:14 PM
Note I didn't say Karpathy's nanoGPT, I said use the speedrun.Transformers are universal function approximators. When well-tuned, they often start to approximate other innovations. Not always, thank god, but often enough that you have to be careful.
by Herring
1/22/2026 at 9:28:14 AM
ok, thanks. I am taking it slow thenby tuned
1/19/2026 at 1:27:34 AM
Labs were also competing to train BERT's for $20 or less. People still use them a lot, too.https://www.databricks.com/blog/mosaicbert
I'll add they should do a number of small, training runs with different architectures and data mixes. That proves generalization.
by nickpsecurity
1/18/2026 at 4:57:05 PM
Depending on how different the attention mechanism is, that might not work. If it’s just a faster / different way of finding the tokens to attend to, sure. But I get the sense the author is implying this method uses different semantics somehow. Although tbh I didn’t follow it entry.by oofbey
1/18/2026 at 5:23:15 PM
This is interesting. Has there been more research into this architecture? I hear about it once every few years but it always seems like a niche / experimental thing. But based on the graph in their blog post you'd expect every company to be using this.by andai
1/19/2026 at 6:18:28 AM
This is a novel re-interpretation of the Transformer, based on my previous research made with a library called `arrowspace`.It is somehow what is called a "Grassmann-like flow" but without the Plucker embedding, or also similar to what is done in DavisTensor but relying on spectral Laplacian instead of purely geometric distances.
The problem with a lot of stuff done before is that it focuses on dense representations. This architecture is focuses on sparse representation and provides a new approximation computation based on energy-informed graphs.
by tuned
1/19/2026 at 6:12:02 AM
thanks for reading. I cannot retrain an existing model as the self-attention mechanism has been completely redesigned. The Keys and Values in self-attention are stored as scalars, so a latent space with traditional weights does not make sense if used in the context of a topological transformer. The two latent spaces would be somehow equivalent eventually but they would store totally different values.by tuned
1/18/2026 at 10:01:06 PM
That doesn’t tell you if the new method continues to perform better at higher parameter counts.by throwaway314155
1/19/2026 at 6:39:16 AM
it most-likely will in terms of performance as it uses 50% less memory (for sure it will at inference time that is the most used operation on web services), because it can leverage longer T and D if the design is confirmed and the quality of generation is comparable to other models. If this very basic assumption is correct, it means a lot of savings in electricity as the same GPUs can resolve more requests.by tuned
1/20/2026 at 12:01:59 AM
By performance, I meant the accuracy of the model, not the runtime/memory characteristics.by throwaway314155
1/18/2026 at 10:51:11 PM
Nor that the training from scratch will even work.by amelius
1/19/2026 at 6:28:20 AM
exactly, that is the current objective. To proove that generation for a specific domain is on-par with causal attention modelsby tuned