A polynomial autoencoder beats PCA on transformer embeddings

5/8/2026 at 9:32:09 AM

This sound like projecting data into the linear space spanned by {x_i, x_i*x_j} where x_i are the features variables, and then applying standard regularization methods to remove noise and low value coefficients.

Anisotropy and the cone ideas may explain why PCA underperforms, but it does not uniquely justify this particular quadratic decoder. The geometric story is not doing explanatory work beyond “data is nonlinear,” and the real substance is simply that second-order reconstruction empirically helps.

by folderquestion

5/8/2026 at 3:16:09 PM

Author here. Fair characterization, and a fair critique on the geometric story. A few clarifications. I don't claim {x_i, x_i·x_j} is the right lift specifically — the post itself shows datasets where the quadratic decoder gives essentially no improvement over PCA. The contribution is empirical: "second-order is the simplest nonlinear decoder you can fit in closed form, and on anisotropic embeddings it picks up real signal that linear decoders miss." Whether degree 3 would help further is open. Degree 3 blows up fast: at d=100 that's 175K features, and the Ridge solve at that scale starts memorizing the corpus rather than generalizing (§7 in the post discusses this trap at d=256 already). So degree 2 is partly a choice, partly a practical ceiling for the closed-form route.

by pleshkov

5/9/2026 at 1:00:57 AM

Is this similar Voltera series in signal processing?

by BobbyTables2

5/8/2026 at 3:22:46 PM

It sounds like this replaces the PCA reconstruction function with a quadratic.

The normal PCA encoding:

1) Given a mean-center-scaled X matrix, get the latent variable matrix T with X = T * P’ + e, where P = loadings and e = residuals. The P is your model, so for a new vector xnew, you can calculate tnew = xnew * P (because P’ * P = I).

This is the encoder —- nothing changes here. The original matrix is dimensionally reduced with residuals e discarded. This is why PCA is lossy.

The decoder is where things diverge

The usual PCA decoder reconstructs a given latent variable t_any by using the trained P loadings, like thus x_reconstructed = t_any * P’. This reconstructed data lies on a linear hyperplane, so if the original data did not lie on the hyperplane, reconstruction errors are potetially high.

In your proposal, instead of a linear decoder, you train a quadratic decoder (essentially a classic ridge regression using a quadratic) on the original X. So for your reconstruction, you have x_reconstructed = poly(t_new).

This achieves lower reconstruction error in-sample (naturally, because quadratic is higher order than linear), but your poly function is trained on a particular corpus. Which means that when you’re in-distribution within that corpus, you’re good but when you’re not, you can be very wrong in biased ways that PCA’s linear reconstruction is not.

SO this is not a better technique than PCA in a general sense. It’s a better reconstruction machine when your data is mostly in-sample. It’s a kind of computationally cheap “specialization” on a particular distribution of data, which can be useful if you’re mostly in-distribution but introduces new risks when out-of-distribution.

Whereas PCA just drops the residual and makes modest claims, a quadratic decoder is trying to predict the residual and on out-of-sample data, it can be wrong in biased ways that PCA is not. In other words, it can hallucinate.

But if on a large enough training corpus, chances are we’re going to be in-distribution most of the time, so maybe this could generalize well.

by wenc

5/8/2026 at 5:25:17 PM

This article uses every one of Claude's cliches, e.g., "No SGD, no epochs, no hyperparameter search." It's hard to tell if this is real research.

by rgovostes

5/9/2026 at 4:47:05 PM

All the author’s comments here are straight from Claude too :/

by refibrillator

5/5/2026 at 11:32:06 AM

Author here — questions and pushback both welcome.

by pleshkov

5/8/2026 at 1:37:07 PM

Cool work! Something that worries me with PCA is that it's designed to retain variance, but variance might nor be the right metric for the semantics we want to retrieve. Ditto UMAP/tSNE that retains distances in lower dimensionality... If semantics are mostly encoded as directions on subsets of dimensions, PCA and friends would be too blunt of a tool... I wonder if a better approach would be linear probes or other decoders for a wide range of the concepts one wants to retrieve, and then optimize compression while keeping those retrievals as high as possible... i.e. tune the compressor to the usecase, like MP3 or MPEG do.

by brunosan

5/8/2026 at 8:05:01 AM

In the article, you mention this approach requires no search over hyper-parameter, because the method comprises a closed-form solution with "simple" linear algebra. I agree with this, but do you not in think need to tune the L2-regularization strength? That would for me be a hyper-parameter you would need to do a CV over (or similarly).

by Devilstro

5/8/2026 at 3:33:38 PM

Fair point — lam is technically a hyperparameter. In practice I used lam=1e-3 (the default in the code) across all four models without tuning, and the gap to PCA is robust enough that small variations don't change the conclusion. So more accurately: "one hyperparameter with a benign default" rather than "no hyperparameters" — you're right I overstated.

by pleshkov

5/8/2026 at 3:36:20 PM

Looking at your experiment code, it seems like the retrieval experiments are done with the reconstructed vectors of dimension D rather than the compressed vectors of dimension d, which doesn't have any direct performance improvements. Later on in the post you indicate that the real advantage is that the residuals are more isotropic and therefore you can quantize the pair (p, V_resid) with less quality degradation, but I don't see any experiments actually verifying that retrieval quality holds up in this setting. Also, it's not quite clear to me how you efficiently compute cosine similarity for vectors encoded in this form. Doesn't the V_resid part of the computation require something significantly more complex than a dot product?

by aesthesia

5/9/2026 at 11:06:24 PM

I agree that we don't want to reconstruct the whole vector while retrieval and it makes poly-AE toy-like at the current state non production ready. My main interest here in the just taking more recall pp in closed form. And then think about how to make it fast. In all threads I got a good intermediate thoughts about the topic which may help me to bring to closer to production form

by pleshkov

5/8/2026 at 8:06:07 AM

You should benchmark the retrieval speed of each method in terms of queries per second. I suspect that the gain in bandwidth you get from slightly better compression will be defeated by decompression being much more expensive.

by yorwba

5/8/2026 at 1:36:21 PM

I think your per-axis std normalization is likely doing a big pile of the work —- it’s fairly well-known that “wrong” PCA, setting sigma=Id or just taking a square root, gives better embeddings than the un-normalized version. It would be worth showing a comparison to similarly-normalized PCA I think, if it’s not too hard?

by dbfclark

5/9/2026 at 10:29:13 PM

Just checked the normalization point. You were partially right, sqrt-normalization makes the difference x2 less. I'm updating the numbers in the post. Interesting moment. I did a smoke test of poly-AE without whitening, and the result didn't change. I won't mention it in the post cause right now I'm not sure if it's a random effect or really a polynomial lift compensates normalization

by pleshkov

5/8/2026 at 3:23:39 PM

Good catch, this is the obvious ablation I should have included. I'll re-run with per-axis normalized PCA as a separate baseline and post numbers in this thread tomorrow. Prior: I expect some of the gap to come from normalization, but not all — the no-improvement results on isotropic datasets (§4) suggest there's structural signal the polynomial cross-terms catch that linear projection structurally can't. But that's a prediction; let me actually run it.

by pleshkov

5/8/2026 at 9:02:34 AM

Cool idea. But it only works when the data never changes. could you make a streaming/incremental version? One that updates the math cheaply when new data arrives, instead of recomputing everything, or does the math fundamentally prevent it?

by afxuh

5/8/2026 at 10:01:45 AM

[dead]

by mpaiello

5/8/2026 at 9:33:52 AM

Really cool! I was investigating PCA on retrieval, thanks for the references.

by stephantul

5/8/2026 at 9:51:04 PM

So this is PCA in kernel space?

by roger_

5/8/2026 at 9:58:34 AM

Geometric Algebra (GA) (Clifford Algebra) also has high potential to transform neural architectures. Models like the Geometric Algebra Transformer (GATr) and Versor (2026) demonstrate it can enhance or even make the Attention Mechanism obsolete.

By representing data as multivectors, translational and rotational symmetries are encoded natively which allows them to handle geometric hierarchies with massive efficiency gains (reports of up to 78x speedups and 200x parameter reductions) compared to standard Transformers.

> A novel sequence architecture is introduced, Versor, which uses Conformal Geometric Algebra (CGA) in place of traditional linear operations to achieve structural generalization and significant performance improvements on a variety of tasks, while offering improved interpretability and efficiency. By embedding states in the manifold and evolving them via geometric transformations (rotors), Versor natively represents -equivariant relationships without requiring explicit structural encoding. Versor is validated on chaotic N-body dynamics, topological reasoning, and standard multimodal benchmarks (CIFAR-10, WikiText-103), consistently outperforming Transformers, Graph Networks, and geometric baselines (GATr, EGNN).

https://arxiv.org/abs/2602.10195

by mentalgear

5/9/2026 at 10:18:59 AM

The polynomial lift in this post originally came out of an unsuccessful experiment with hyperbolic embeddings. The idea was to embed corpora into a hyperbolic ball (anisotropic embeddings have a tree-like structure that hyperbolic space could exploit). The lift was a tool to go from hyperbolic latent back to Euclidean for retrieval. Hyperbolic part didn't work; the lift evaluated standalone kept showing real signal, and that became this post.

by pleshkov

5/9/2026 at 4:43:58 PM

This was a good write-up but the author's claim that this is relatively unknown in machine learning is not quite accurate.

Back in grad school when we covered basic OLS methods expanding the feature space using a quadratic manifold was a common technique for teaching that regression can support nonlinear features and still remain "linear in the coefficients".

It is also implemented in widely used ML libraries. Scikit-learn, for example, has a KernelPCA object which supports using a polynomial kernel (in this instance a degree two) that computes the inner product on a explicit feature map which contains all second-order monomials.

by gh0stmach1ne

5/8/2026 at 7:19:29 AM

My understanding after scanning the code examples is the technique expands the dimensionality of each data point with a set consisting of the quadratic coefficients of its existing dimensions. I thought it sounded like kernel PCA.

by yobbo

5/8/2026 at 9:13:57 AM

I think that kernel PCA is a strict superset of PCA. That would make it trivially true that it beats PCA.

by ekjhgkejhgk

5/8/2026 at 3:53:02 PM

There's also work in polynomial regression as an alternative to regular, neural networks.

Polynomial Regression As an Alternative to Neural Nets (2018) https://arxiv.org/abs/1806.06850

Π-nets: Deep Polynomial Neural Networks (2020) https://arxiv.org/abs/2003.03828

by nickpsecurity

5/9/2026 at 8:37:55 AM

What makes this different from just kernel PCA with the quadratic kernel?

by ComplexSystems

5/10/2026 at 11:56:21 AM

It's close but not the same. Kernel PCA lifts all D coordinates which gives M around 525k at D = 1024. In the post I do PCA first to reduce D to d = 256, then lift only those d coordinates, M = 33k. Much smaller, much faster Ridge solve.

by pleshkov

5/10/2026 at 6:41:27 PM

That makes sense. If you could magically just get the top d PCs in quadratic kernel space without having to compute the whole kernel matrix, and then just do top-d quadratic PCs -> ridge, would that be better than doing the PCA -> top-d -> quadratic kernel ridge as you are now?

by ComplexSystems

5/8/2026 at 6:23:51 AM

I'm just a casual LLM user, but your description of the anisotropy made me think about the recent work on KV cache quantization techniques such as TurboQuant where they apply a random rotation on each vector before quantizing, as I understood it precisely to make it more isotropic.

But for RAG that might be too much work per vector?

by magicalhippo

5/8/2026 at 10:22:27 AM

this looks awesome. i've been struggling with vector compression, and have been trying PCA + all sorts of rotations. looking forward to trying this out

by electroglyph

5/8/2026 at 7:46:40 PM

Nice idea. The UI feels simple and easy to use.

by Abhijeet620380

5/8/2026 at 7:44:58 AM

I came here from a discussion about CS students who should not be bothered to set up email filters. How can they ever expect to be able to digest just the first paragraph in that article?

by teiferer

5/8/2026 at 9:41:47 AM

This kind of attitude isn’t long for this world. The time for this knowledge locked up in academia is over and in 15 years we’ll look back on this time as the dark ages as open code and models eclipses it.

None of this stuff is as difficult to understand as people claim it is once you work with it.

by whywhywhywhy

5/8/2026 at 7:53:41 AM

FWIW I found it quite straight forward. But then I did have some linear algebra back at uni.

That said I do think it's a good habit to either write out abbreviations in full or link to say Wikipedia, eg for PCA[1]. It's a well-known tool but still if you come from a slightly different field it might not ring a bell.

[1]: https://en.wikipedia.org/wiki/Principal_component_analysis

by magicalhippo

5/8/2026 at 9:08:46 AM

Definitely not by furthering their email client wrangling skills.

by perching_aix

5/8/2026 at 2:05:35 PM

By taking a class in machine learning. What kind of a CS student can't set up an email filter??

by esafak