Harnessing the Universal Geometry of Embeddings

5/21/2025 at 6:38:51 PM

Hi HN, I'm Jack, the last author of this paper. It feels good to release this, the fruit of a two-year quest to "align" two vector spaces without any paired data. It's fun to look back a bit and note that at least two people told me this wasn't possible:

1. An MIT professor who works on similar geometry alignment problems didn't want to work on this with me because he was certain we would need at least a little bit of paired data

2. A vector database startup founder who told me about his plan to randomly rotate embeddings to guarantee user security (and ignored me when I said it might not be a good idea)

The practical takeaway is something that many people already understood, which is that embeddings are not encrypted, even if you don't have access to the model that produced them.

As one example, in the Cursor security policy (https://www.cursor.com/security#codebase-indexing) they state:

> Embedding reversal: academic work has shown that reversing embeddings is possible in some cases. Current attacks rely on having access to the model [...]

This is no longer the case. Since all embedding models are learning ~the same thing, we can decode any embedding vectors, given we have at least a few thousand of them.

by jxmorris12

5/21/2025 at 8:30:38 PM

I hate to be "reviewer 2", but:

I used to work on what your paper calls "unsupervised transport", that is machine translation between two languages without alignment data. You note that this field has existed since ~2016 and you provide a number of references, but you only dedicate ~4 lines of text to this branch of research. There's no comparison about why your technique is different to this prior work or why the prior algorithms can't be applied to the output of modern LLMs.

Naively, I would expect off-the-shelf embedding alignment algorithms (like <https://github.com/artetxem/vecmap> and <https://github.com/facebookresearch/fastText/tree/main/align...>, neither of which are cited or compared against) to work quite well on this problem. So I'm curious if they don't or why they don't.

I can imagine there is lots of room for improvements around implicit regularization in the algorithms. Specifically, these algorithms were designed with word2vec output in mind (typically 300 dimensional vectors with 200000 observations), but your problem has higher dimensional vectors with fewer observations and so would likely require different hyperparameter tuning. IIRC, there's no explicit regularization in these methods, but hyperparameters like stepsize/stepcount can implicitly add L2 regularization, which you probably need for your application.

---

PS.

I *strongly dislike* your name of vec2vec. You aren't the first/only algorithm for taking vectors as input and getting vectors as output, and you have no right to claim such a general title.

---

PPS.

I believe there is a minor typo with footnote 1. The note is "Our code is available on GitHub." but it is attached to the sentence "In practice, it is unrealistic to expect that such a database be available."

by jackpirate

5/21/2025 at 8:42:40 PM

Hey, I appreciate the perspective. We definitely should cite both those papers, and will do so in the next version of our draft. There are a lot of papers in this area, and they're all a few years old now, so you might understand how we missed two of them.

We tested all of the methods in the Python Optimal Transport package (https://pythonot.github.io/) and reported the max in most of our tables. So some of this is covered. A lot of these methods also require a seed dictionary, which we don't have in our case. That said, you're welcome to take any number of these tools and plug them into our codebase; the results would definitely be interesting, although we can expect the adversarial methods still work best, as they do in the problem settings you mention.

As for the name – the paper you recommend is called 'vecmap' which seems equally general, doesn't it? Google shows me there are others who have developed their own 'vec2vec'. There is a lot of repetition in AI these days, so collisions happen.

by jxmorris12

5/21/2025 at 10:57:12 PM

> We tested all of the methods in the Python Optimal Transport package (https://pythonot.github.io/) and reported the max in most of our tables.

Sorry if I'm being obtuse, but I don't see any mention of the POT package in your paper or of what specific algorithms you used from it to compare against. My best guess is that you used the linear map similar to the example at <https://pythonot.github.io/auto_examples/domain-adaptation/p...>. The methods I mentioned are also linear, but contain a number of additional tricks that result in much better performance than a standard L2 loss, and so I would expect those methods to outperform your OT baseline.

> As for the name – the paper you recommend is called 'vecmap' which seems equally general, doesn't it? Google shows me there are others who have developed their own 'vec2vec'. There is a lot of repetition in AI these days, so collisions happen.

But both of those papers are about generic vector alignment, so the generality of the name makes sense. Your contribution here seems specifically about the LLM use case, and so a name that implies the LLM use case would be preferable.

I do agree though that in general naming is hard and I don't have a better name to suggest. I also agree that there's lots of related papers, and you can't cite/discuss them all reasonably.

And I don't mean to be overly critical... the application to LLMs is definitely cool. I wouldn't have read the paper and written up my critiques if I didn't overall like it :)

by jackpirate

5/21/2025 at 8:37:27 PM

Naming things is hard. Noting the two alternative approaches that you referenced are called "vecmap" and "alignment" which "aren't the first/only algorithm for ... and you have no right to claim such a general title" could easily apply there as well.

by newfocogi

5/21/2025 at 11:01:31 PM

Except those papers are 8ish years old; they actually were among the first 2-3 algs for this task; and they studied the fully general vector space alignment problem. But I agree that naming things is hard and don't have a better name.

by jackpirate

5/21/2025 at 10:53:27 PM

> I strongly dislike your name of vec2vec.

Imagine having more than a passing understanding of philosophy, and then reading much of any major computer science papers. By this "No right to claim" logic, I'd have you all on trial.

by mjburgess

5/22/2025 at 12:49:32 AM

The problem solved in this paper is strictly harder than alignment. Alignment works with multiple, unmatched representations of the same inputs (e.g, different embeddings of the same words). The goal is to match them up.

The goal here is harder: given an embedding of an unknown text in one space, generate a vector in another space that's close to the embedding of the same text -- but, unlike in the word alignment problem, the texts are not known in advance.

Neither unsupervised transport, nor optimal alignment can solve this problem. Their input sets must be embeddings of the same texts. The input sets here are embeddings of different texts.

FWIW, this is all explained in the paper, including even the abstract. The comparisons with optimal assignment explicitly note that this is an idealized pseudo-baseline, and in reality OA cannot used for embedding translation (as opposed to matching, alignment, correspondence, etc.)

by austinpilot

5/21/2025 at 8:17:12 PM

Hooray, finally we are getting the geometric analysis of embedding spaces we need. Information geometry and differential geometry is finally getting its moment in the sun!

by nimish

5/21/2025 at 7:52:10 PM

I must admit reading the abstract made me think to myself that I should read the paper in skeptical mode.

Does this extend to being able to analytically determine which concepts are encodable in one embedding but not another? An embedding from a deft tiny stories LLM presumably cannot encode concepts about RNA replication.

Assuming that is true. If you can detect when you are trying to put a square peg into a round hole, does this mean you have the ability to remove square holes from a system?

by Lerc

5/21/2025 at 7:59:22 PM

Very fair!

> Does this extend to being able to analytically determine which concepts are encodable in one embedding but not another? An embedding from a deft tiny stories LLM presumably cannot encode concepts about RNA replication.

Yeah, this is a great point. We're mostly building off of this prior work on the Platonic Representation Hypothesis (https://arxiv.org/abs/2405.07987). I think our findings go-so-far as to apply to large-enough models that are well-enough trained on The Internet. So, text and images. Maybe audio, too, if the audio is scraped from the Internet.

So I don't think your tinystories example qualifies for the PRH, since it's not enough data and it's not representative of the whole Internet. And RNA data is (I would guess) something very different altogether.

> Assuming that is true. If you can detect when you are trying to put a square peg into a round hole, does this mean you have the ability to remove square holes from a system?

Not sure I follow this part.

by jxmorris12

5/21/2025 at 8:22:38 PM

>So I don't think your tinystories example qualifies for the PRH, since it's not enough data and it's not representative of the whole Internet. And RNA data is (I would guess) something very different altogether.

My thought there was that you'd be comparing tinystories to a model that trained on the entire internet. The RNA related information would be a subset of the second representation that has no comparable encoding in the tinystories space. Can you detect that? If both models have to be of sufficient scale to work the question becomes "what is the scale, is it sliding or a threshold? "

>> Assuming that is true. If you can detect when you are trying to put a square peg into a round hole, does this mean you have the ability to remove square holes from a system?

>Not sure I follow this part.

Perhaps the metaphor doesn't work so well. If you can detect if something is encodable in one embedding model but not another. Can you then leverage that detection ability in order to modify an embedding model so that it cannot represent an idea.

by Lerc

5/21/2025 at 10:43:13 PM

As I read the paper, you would be able to detect it in a couple of ways

1. possibly high loss where the models don't have compatible embedding concepts 2. given a sufficient "sample" of vectors from each space, projecting them to the same backbone would show clusters where they have mismatched concepts

It's not obvious to me how you'd use either of those to tweak the vector space of one to not represent some concept, though.

But if you just wanted to make an embedding that is unable to represent some concept, presumably you could already do that by training disjoin "unrepresentable concepts" to a single point.

by eximius

5/21/2025 at 9:55:20 PM

I read a lot of AI papers on arxiv, and it's been a while since I read one where the first line of the abstract had me scoffing and done.

> We introduce the FIRST method for translating text embeddings from one vector space to another without any paired data

(emphasis mine)

Nope. I'm not gonna go a literature search for you right now and find the references, but this is certainly not the first attempt to do unsupervised alignment of embeddings, text or otherwise. People were doing this back in ~2016.

by oofbey

5/22/2025 at 12:52:12 AM

There has been plenty of work on alignment of embeddings, a lot of it cited in the paper. This paper solves the problem of translation, where (unlike in word alignment) there isn't already a set of candidate vectors in the target embedding space. It's generation, not matching.

by austinpilot

5/22/2025 at 8:03:30 AM

Thank you for sharing! I have a question about embedding versioning/migration. I'm not sure if this research solves it?

Say I want to build an app with embedding/vector search. Currently, my embeddings are generated by model A, that is not open source. Later, I find a better embedding model B, and my new data will be using this model B. Since A and B are two different vector spaces, how can I migrate A to B, or how can I make vector search work without migrating A to B?

Can your research solve this problem? Also，if all embedding models are the same, is there a point of upgrading the model at all? some must be better trained than others?

by billconan

5/21/2025 at 9:05:31 PM

Does this result imply that if we had a LLM trained on a very large volume of only English data, and one trained only on a very large volume of data in another language, your technique could be used to translate between the two languages? Pretty cool. If we somehow came across a huge volume of text in an alien language, your technique could potentially translate their language into ours (although maybe the same could be achieved just by training a single LLM on both languages?).

by logicchains

5/22/2025 at 1:35:35 AM

> (although maybe the same could be achieved just by training a single LLM on both languages?).

Intuitively I assume this would work even better.

by cubefox

5/21/2025 at 7:53:05 PM

Doesn't the space of embeddings have some symmetries, that when applied does not change the output sequence ?

For example, global rotation that does not change embedded vector x embedded vector dot-product and changes query vector x embedded dot-product in an equivariant way.

by srean

5/21/2025 at 7:57:30 PM

Yes. So the idea was that an orthogonal rotation will 'encrypt' the embeddings without affecting performance, since orthogonality preserves cosine similarity. It's a good idea, but we can un-rotate the embeddings using our GAN.

by jxmorris12

5/21/2025 at 8:00:58 PM

I can understand that two relatively rotated embeddings from the same or similar dataset can be realigned as long as they don't have internal geometric symmetries. The same way we can re-align two globes -- look for matching shapes, continents.

EDIT: Perfect symmetries, for example, feature-less spheres, or the analogues of platonic solids would break this. If the embedded space has no geometric symmetries you would be in business.

Re-aligning, essentially would be akin to solving a graph-isomorphism problem.

Lie algebraic formulation would make it less generic than an arbitrary graph-isomorphism problem. Essentially reduce it to a high dimensional procrustes problem. Generic graph isomorphism can be quite a challenge.

https://en.m.wikipedia.org/wiki/Procrustes_analysis

EDIT: Sinkhorn balancing over a set of points (say a d-dimensional tetrahedron, essentially a simplex) furthest from each other might be a good first cut to try. You might have already done so, I haven't read your paper yet.

by srean

5/21/2025 at 8:04:39 PM

Right, that's why the baselines here come from the land of Optimal Transport, which looks at the world through isomorphisms, exactly as you've suggested.

The GAN works way better than traditional OT methods though. I really don't know why, this is the part that feels like magic to me.

by jxmorris12

5/21/2025 at 8:09:58 PM

Got you. I can understand that this has a chance of working if the embeddings have converted to their global optimum. Otherwise all bets ought to be off.

All the best.

I can totally understand the professors point, little bit of alignment data ought significantly increase the chance of success. Otherwise it will have to rely on these small deviations from symmetry to anchor the orientation.

by srean

5/21/2025 at 8:16:49 PM

Yeah, we didn't get around to testing what the impact would be of having a small amount of aligned data. I've seen other papers asserting that as few as five pairs is enough to go a long way.

by jxmorris12

5/22/2025 at 1:24:15 AM

> we can decode any embedding vectors, given we have at least a few thousand of them.

Do you think that could be sufficient to translate the Voynich manuscript?

by cubefox

5/21/2025 at 9:12:26 PM

Don't you mean "John" instead of "Jack"? :)

by chompychop

5/21/2025 at 8:25:18 PM

I am a curious amateur, so I may say something dumb. but: Suppose you take a number of smaller embedding models, and one more advanced embedding model. Suppose for a document, you convert each model's embeddings to their universal embedding representation and examine the universal embedding spaces.

On a per document basis, would the universal embeddings of the smaller models (less performant) cluster around the better model's universal embedding space, in a way suggestive that they are each targeting the 'true' embedding space, but with additional error/noise?

If so, can averaging the universal embeddings from a collection of smaller models effectively approximate the universal embedding space of the stronger model? Could you then use your "averaged universal embeddings" as a target to train a new embedding model?

by SubiculumCode

6/3/2025 at 3:55:25 PM

Hey, I read the paper in detail and presented to colleagues during our reading group.

I still do not understand exactly where D1L comes from in LGan(D1L, T(A1(u)). Is D1L simply A1(u)?

I also find that mixing notation in figure 2 and 3 makes it tricky.

Would have loved to have more insights from the results in the tables.

And more results from inversion, on more than Enron dataset. Since that is one end goals, even if reusing another method.

Thank you for the paper, very interesting!

by lpasselin

5/21/2025 at 7:06:04 PM

The fact that embeddings from different models can be translated into a shared latent space (and back) supports the notion that semantic anchors or guides are not just model-specific hacks, but potentially universal tools. Fantastic read, thank you.

Given the demonstrated risk of information leakage from embeddings, have you explored any methods for hardening, obfuscating, or 'watermarking' embedding spaces to resist universal translation and inversion?

by airylizard

5/21/2025 at 7:14:10 PM

> Given the demonstrated risk of information leakage from embeddings, have you explored any methods for hardening, obfuscating, or 'watermarking' embedding spaces to resist universal translation and inversion?

No, we haven't tried anything like that. There's definitely a need for it. People are using embeddings all over the place, not to mention all of the other representations people pass around (kv caches, model weights, etc.).

One consideration is that's likely going to be a tradeoff between embedding usefulness and invertability. So if we watermark our embedding space somehow, or apply some other 'defense' to make inversion difficult, we will probably sacrifice some quality. It's not clear yet how much that would be.

by jxmorris12

5/21/2025 at 7:15:36 PM

Are you continuing research? Is there somewhere we can follow along?

by airylizard

5/21/2025 at 7:35:00 PM

Yes! For now just our Twitters: - Rishi, the first author: x.com/rishi_d_jha - Me: x.com/jxmnop

And there's obviously always ArXiv. Maybe we should make a blog or something, but the updates really don't come that often.

by jxmorris12

5/21/2025 at 10:46:43 PM

I don't see how the "different data" aspect is evidenced. If the "modality" of the data is the same, we're choosing a highly specific subset of all possible data -- and, in practice, radically more narrow than just that. Any sufficiently capable LLM is going to have to be trained on a corpus not-so-dissimilar to all electronic texts which exist in the standard corpa used for LLM training.

The idea that a data set is "different" merely because its some subset of this maximal corpa is a difference without a distinction. What isnt being proposed is, say, that training just on all the works of scifi fiction lead to a zero-info translatable embedding space projectable into all the works of horror, and the like (or say that english-scifi can be bridged to japanese-scifi by way of a english-japanese-horror-corpus).

The very objective of creating LLMs with useful capabilities entials an extremely similar dataset starting point. We do not have so many petabytes of training data here that there is any meaningful sense in which OpenAI uses "only this discrete subspace" and perplextiy, "yet another". All useful LLMs sample roughly randomly across the maximal corpus that we have to hand.

Thus this hype around there being a platonic form of how word tokens ought be arranged seems wholly unevidenced. Reality has a "natural arrangement" -- this does not show that our highly lossy encoding of it in english has anything like a unique or natural correspondence. It has a circumstantial correspondance in "all recorded electronic texts" which are the basis for training all generally useful LLMs.

by mjburgess

5/22/2025 at 1:13:13 AM

This arguably means that we could translate any unknown alien message if it is sufficiently long and not encrypted: 1) Create embeddings from the Alienese message. 2) Convert them to English.

Could we also use this to read the Voynich manuscript, by converting Voynichese into embeddings and embeddings into English text? Perhaps, though I worry the manuscript is too short for that.

by cubefox

5/21/2025 at 10:56:34 PM

Very cool! I've been looking for something like this for a while and couldn't find anyone doing it. I've been investigating a way to translate LoRAs between models and this seems like it could be a first step towards that.

by kevmo314

5/23/2025 at 8:37:45 AM

Huh. So Plato was right. This has many implications for philosophy. Interestingly, the 12th century Platonic-influenced Arab philosopher Ibn Arabi described methods of converting text to numbers (embeddings) and then performing operations on those numbers to yield new meanings (inference). A 12th century LLM? His books are full of these kinds of operations (called Abjad math) and a core part of his textual hermeneutics.

by _1tem