Don't use cosine similarity carelessly

1/14/2025 at 11:21:02 PM

If you're using cosine similarity when retrieving for a RAG application, a good approach is to then use a "semantic re-ranker" or "L2 re-ranking model" to re-rank the results to better match the user query.

There's an example in the pgvector-python that uses a cross-encoder model for re-ranking: https://github.com/pgvector/pgvector-python/blob/master/exam...

You can even use a language model for re-ranking, though it may not be as good as a model trained specifically for re-ranking purposes.

In our Azure RAG approaches, we use the AI Search semantic ranker, which uses the same model that Bing uses for re-ranking search results.

by pamelafox

1/14/2025 at 11:23:19 PM

Another tip: do NOT store vector embeddings of nothingness, mostly whitespace, a solid image, etc. We've had a few situations with RAG data stores which accidentally ingested mostly-empty content (either text or image), and those dang vectors matched EVERYTHING. WAs I like to think of it, there's a bit of nothing in everything.. so make sure that if you are storing a vector embedding, there is some amount of signal in that embedding.

by pamelafox

1/15/2025 at 12:15:31 AM

Interesting. A project I worked on (audio recognition for a voice-command system) we ended up going the other way and explicitly adding an encoding of "nothingness" (actually 2, one for "silence" and another for "white noise") and special casing them ("if either 'silence' or 'noise' is in the top 3 matches, ignore the input entirely").

This was to avoid the problem where, when we only had vectors for "valid" sounds and there was an input that didn't match anything in the training set (a foreign language, garbage truck backing up, a dog barking, ...) the model would still return some word as the closest match (there's always a vector that has the highest similarity) and frequently do so with high confidence i.e. even though the actual input didn't actually match anything in the training set, it would be "enough" more like one known vector than any of the others that it would pass most threshold tests, leading to a lot of false positives.

by variaga

1/14/2025 at 11:40:51 PM

That sounds like a problem for the embedding, would you need to renormalise so that low signal inputs could be well represented. A white square and a red square shouldn't be different levels of details. Depending on the purpose of the vector embedding, there should be a difference between images of mostly white pixels and partial images.

Disclaimer, I don't know shit.

by pbhjpbhj

1/14/2025 at 11:43:17 PM

I should clarify that I experienced these issues with text-embedding-ada-002 and the Azure AI vision model (based on Florence). I have not tested many other embedding models to see if they'd have the same issue.

by pamelafox

1/15/2025 at 12:05:18 AM

FWIW I think you're right, we have very different stacks, and I've observed the same thing, with a much clunkier description thank your elegant way of putting it.

I do embeddings on arbitrary websites at runtime, and had a persistent problem with the last chunk of a web page matching more things. In retrospect, its obvious that the smaller the chunk was, the more it was matching everything

Full details: MSMARCO MiniLM L6V3 inferenced using ONNX on iOS/web/android/macos/windows/linux

by refulgentis

1/15/2025 at 5:47:37 AM

You could also work around this by adding a scaling transformation that normalizes and centers (e.g. sklearn StandardScaler) in between the raw embeddings — based on some example data points from your data set. Might introduce some bias, but I’ve found this helpful in some cases with off the shelf embeddings.

by mattvr

1/16/2025 at 1:12:57 AM

Use horrible quality embeddings and get horrible results. No surprise there. ada is obsolete - I would never want to use it.

by OutOfHere

1/15/2025 at 7:44:35 AM

We used to have this problem in AWS Rekognition; a poorly detected face -- e.g. a blurry face in the background -- would hit with high confidence with every other blurry face. We fixed that largely by adding specific tests against this [effectively] null vector. The same will work for text or other image vectors.

by jhy

1/15/2025 at 12:00:53 PM

If you imagine a cartesian coordinate space where your samples are clustered around the origin, then a zero vector will tend to be close to everything because it is the center of the cluster. Which is a different way of saying that there's a bit of nothing in everything I guess :)

by short_sells_poo

1/15/2025 at 2:12:20 AM

Same experience embedding random alphanumeric strings or strings of digits with smaller embedding models—very important to filter those out.

by jsenn

1/15/2025 at 7:27:16 AM

Statistically you want the retriever to be trained for cosine similarity. Vision LLM retriever such as DSE do this correctly. No need for reranker once done.

by pilooch

1/16/2025 at 1:15:31 AM

Precisely. Ranking is a "smell" in this regard. They are using ada embedding which I consider to be of poor quality.

by OutOfHere

1/15/2025 at 8:25:00 AM

I propose a different technique:

- Use a large context LLM.

- Segment documents to 25% of context or alike.

- With RAG, retrieve fragments from all the documents, they do a first pass semantic re-ranking like this, sending to the LLM:

I have a set of documents I can show you to reply the user question "$QUESTION". Please tell me from the title and best matching fragments what document IDs you want to see to better reply:

[Document ID 0]: "Some title / synopsis. From page 100 to 200"

... best matching fragment of document 0...

... second best fragment ...

[Document ID 1]: "Some title / synopsis. From page 200 to 300"

... fragmnets ...

LLM output: show me 3, 5, 13.

New query, with attached the full documents for 75% of context window.

"Based on the attached documents in this chat, reply to $QUESTION".

by antirez

1/15/2025 at 3:46:17 PM

Slow/expensive. Good idea otherwise.

by datadrivenangel

1/15/2025 at 4:37:20 PM

but inference time compute is the new hotness.

by danielmarkbruce

1/22/2025 at 5:43:12 AM

great, would love to see your application, is it on github?

by sharath7693000

1/15/2025 at 2:26:24 AM

So word vectors solve the problem that two words may never appear in the same context, yet can be strongly correlated. "Python" may never be found close to "Ruby", yet "scripting" is likely to be found in both their contexts so the embedding algorithm will ensure that they are close in some vector space. Except it rarely works well because of the curse of dimensionality.

Perhaps one could represent word embeddings as vertices, rather than vectors? Suppose you find "Python" and "scripting" in the same context. You draw a weighted edge between them. If you find the same words again you reduce the weight of the edge. Then to compute the similarity between two words, just compute the weighted shortest path between their vertices. You could extend it to pair-wise sentence similarity using Steiner trees. Of course it would be much slower than cosine similarity, but probably also much more useful.

by bjourne

1/15/2025 at 2:40:43 AM

You might be interested in HippoRAG [1] which takes a graph-based approach similar to what you’re suggesting here.

[1]: https://arxiv.org/abs/2405.14831

by jsenn

1/15/2025 at 1:39:17 PM

This was called ontology or semantic network. See e.g. OpenCyc (although it's rather more elaborate). What you propose is rather different than word embeddings, since it can't compare word features (think: connotations) nor ambiguity, and the way to discover similarities symbolically is not a well-understood problem.

by tgv

1/15/2025 at 8:26:46 AM

Embeddings represent more than P("found in the same context").

It is true that cosine similarity is unhelpful if you expect it to be a distance measure.

[0,0,1] and [0,1,0] are orthogonal (cosine 0) but have euclidean distance √2, and 1/3 of vector elements are identical.

It is better if embeddings encode also angles, absolute and relative distances in some meaningful way. Testing only cosine ignores all distances.

by yobbo

1/16/2025 at 1:18:41 AM

Modern embeddings lie on a hypersphere surface, making euclidean equal to cosine. And if they don't, I probably wouldn't want to use them.

by OutOfHere

1/16/2025 at 6:15:33 AM

True, on a hypersphere cosine and euclidean are equivalent.

But if random embeddings are gaussian, they are distributed on a "cloud" around the hypersphere, so they are not equal.

by yobbo

1/15/2025 at 7:23:10 AM

> In the US, word2vec might tell you espresso and cappuccino are practically identical. It is not a claim you would make in Italy.

True, and quite funny. This is an excellent, well-written and very informative article, but this part is wrongly worded:

> Let's have a task that looks simple, a simple quest from our everyday life: "What did I do with my keys?" [and compare it to other notes using cosine similarity]: "Where did I put my wallet" [=> 0.6], "I left them in my pocket" [=> 0.5]

> The best approach is to directly use LLM query to compare two entries, [along the lines of]: "Is {sentence_a} similar to {sentence_b}?"

(bits in brackets paraphrased for quoting convenience)

This will result in the same, or "worse" result, as any LLM will respond that "Where did I put my wallet" is very similar to "What did I do with my keys?", while "I left them in my pocket" is completely dissimilar.

I'm actually not sure what the author was trying to get at here? You could ask an LLM 'is that sentence a plausible answer to the question' and then it would work; but if you ask for pure 'likeness', it seems that in many cases, LLMs' responses will be close to cosine similarity.

by bambax

1/15/2025 at 9:48:55 AM

Well, "Is {sentence_a} similar to {sentence_b}?" is the correct query when we care about some vague similarity of statements. In this case, we should go with something in the line "Is {answer} a plausible answer to the question {question}".

In any way, I see how the example "Is {sentence_a} similar to {sentence_b}?" breaks the flow. The original example was:

    {question}
    
    # A
    
    {sentence_A}
    
    # B

    {sentence_B}

As I now see, I overzealously simplified that. Thank you for your remark! I edited the article. Let me know if it is clearer for you now.

by stared

1/15/2025 at 10:33:46 AM

I also don’t see the problem, if I were asked to rank the sentences by similarity to the question, I wouldn’t rank a possible answer first. In what way is an answer to a question similar to the question?

by echoangle

1/16/2025 at 4:44:00 PM

I believe the intention here was to highlight a use case where cosine similarity falls short, leading into the next section that introduces alternatives. That said, I would appreciate more detail in the 'Extracting the right features' section, if someone has an example I would love to see it.

by Dewey5001

1/15/2025 at 1:02:00 AM

> So, what can we use instead?

> The most powerful approach

> The best approach is to directly use LLM query to compare two entries.

Cross encoders are a solution I’m quite fond of, high performing and much faster. I recently put an STS cross encoder up on huggingface based on ModernBERT that performs very well.

by deepsquirrelnet

1/15/2025 at 10:46:38 AM

Technically speaking, cross encoders are LLMs - they use the last layer to predict similarity (a single number) rather than the probability of the next token. They are faster than generative models only if they are simpler - otherwise, there is no performance gain (the last layer is negligible). In any case, even the simplest cross-encoders are more computationally intensive than those using a dot product from pre-computed vectors.

That said, for many applications, we may be perfectly fine with some version of a fine-tuned BERT-like model rather than using the newest AGI-like SoTA just to compare if two products are vaguely similar, and it is worth putting the other one in suggestions.

by stared

1/16/2025 at 4:54:43 AM

This is true, and I’ve done quite a bit with static embeddings. You can check out my wordllama project if that’s interesting to you.

https://github.com/dleemiller/WordLlama

There’s also model2vec doing some cool things as well in that area. So it’s cool to see recent progress in 2024/5 on simple static embedding models.

On the computational performance note, the performance of cross encoder I trained using ModernBERT base is on par with the roberta large model, while being about 7-8x faster. Still way more complex than static, but on benchmark datasets, much more capable too.

by deepsquirrelnet

1/15/2025 at 4:02:52 AM

I had to look that up… for others:

An STS cross encoder is a model that uses the CrossEncoder class to predict the semantic similarity between two sentences. STS stands for Semantic Textual Similarity.

by sroussey

1/15/2025 at 2:04:13 AM

Link please?

by staticautomatic

1/15/2025 at 2:07:48 AM

Here you go!

https://huggingface.co/dleemiller/ModernCE-base-sts

There’s also the large model, which performs a bit better.

by deepsquirrelnet

1/15/2025 at 2:54:10 AM

Cross encoders still don’t solve the fundamental problem of defining similarity that the author is referring to.

Frankly, the LLM approach the author talks about in the end doesn’t either. What does “similar” mean here?

Given inputs A, B, and C, you have to decide whether A and B are more similar or A and C are more similar. The algorithm (or architecture, depending on how you look at it) can’t do that for you. Dual encoder, cross encoder, bag of words, it doesn’t matter.

by janalsncm

1/16/2025 at 4:48:14 AM

I think what you’re getting at could be addressed a few way. One is explainability — and with an llm you can ask it to tell you why it chose one or the other.

That’s not practical for a lot of applications, but it can do it.

For the cross encoder I trained, I have a pretty good idea what similar means because I created a semi-synthetic dataset that has variants based on 4 types of similarity.

Perhaps not a perfect solution when you’re really trying to split hairs about what is more similar between texts that are all pretty similar, but not all applications need that level of specificity either.

by deepsquirrelnet

1/15/2025 at 2:04:03 AM

The article is basically saying: if the feature vectors are crypticly encoded, then cosine similarity tells you little.

Cosin similarity of two encrypted images would be useless, unencrypt them, a bit more useful.

The 'strings are not the territory' in other words, the territory is the semantic constructs cryptically encoded into those strings. You want the similarity of constructs, not strings.

by SubiculumCode

1/15/2025 at 5:00:42 AM

I can't see these in this article, at all.

I think what it say is under "Is it the right kind of similarity?" :

> Consider books. > For a literary critic, similarity might mean sharing thematic elements. For a librarian, it's about genre classification. > For a reader, it's about emotions it evokes. For a typesetter, it's page count and format. > Each perspective is valid, yet cosine similarity smashes all these nuanced views into a single number — with confidence and an illusion of objectivity.

by j16sdiz

1/16/2025 at 6:01:04 AM

Yes that's a good way to understand it

by rglynn

1/15/2025 at 12:08:11 AM

Cosine similarity and top-k RAG feel so primitive to me, like we are still in the semantic dark ages.

The article is right to point out that cosine similarity is more of an accidental property of data than anything in most cases (but IIUC there are newer embedding models that are deliberately trained for cosine similarity as a similarity measure). The author's bootstrapping approach is interesting especially because of it's ability to map relations other than the identity, but it seems like more of a computational optimization or shortcut (you could just run inference on the input) than a way to correlate unstructured data.

After trying out some RAG approaches and becoming disillusioned pretty quickly I think we need to solve the problem much deeper by structuring models so that they can perform RAG during training. Prompting typical LLMs with RAG gives them input that is dissimilar from their training data and relies on heuristics (like the data format) and thresholds (like topK) that live outside the model itself. We could probably greatly improve this by having models define the embeddings, formats, and retrieval processes (ie learn its own multi-step or "agentic" RAG while it learns everything else) that best help them model their training data.

I'm not an AI researcher though and I assume the real problem is that getting the right structure to train properly/efficiently is rather difficult.

by weitendorf

1/15/2025 at 5:54:18 PM

The author asks:

> Has the model ever seen cosine similarity?

Yes - most of the time, at least for deep learning based semantic search. E.g. for semantic search of text, the majority are using, SentenceTransformers [1], models which have been trained to use cosine similarity. Or e.g. for vector representations of images, people are using models like CLIP [2], which has again been trained to use cosine similarity. (Cosine similarity being used in the training loss, so the whole model is fundamentally "tuned" for cosine similarity.)

Articles like these cause confusion, e.g. I've come across people saying: "You shouldn't use cosine similarity", when they've seen SentenceTransformers being used, and linking articles like these, when in fact you very much should be using cosine similarity with those models.

[1] https://sbert.net

[2] https://arxiv.org/abs/2103.00020

by montebicyclelo

1/15/2025 at 4:41:33 AM

My chunk rewriting method is to use a LLM to generate a title, summary, keyword list, topic, parent topic, and gp topic. Then I embed the concatenation of all of them instead of just the original chunk. This helps a lot.

One fundamental problem of cosine similarity is that it works on surface level. For example, "5+5" won't embed close to "10". Or "The 5th word of this phrase" won't be similar to "this".

If there is any implicit knowledge it won't be captured by simple cosine similarity, that is why we need to draw out those inplicit deductions before embedding. Hence my approach of pre-embedding expansion of chunk semantic information.

I basically treat text like code, and have to "run the code" to get its meaning unpacked.

by visarga

1/15/2025 at 10:51:39 AM

If you ask, "Is '5+5' similar to '10'?" it depends on which notion of similarity you have - there are multiple differences: different symbols, one is an expression, the other is just a number. But if you ask, "Does '5+5' evaluate to the same number as '10'?" you will likely get what you are looking for.

by stared

1/15/2025 at 5:13:32 AM

How do you contextualize the chunk at re-write time?

by gavmor

1/15/2025 at 5:32:48 AM

the original chunk is most likely stored with it in referential format such as an id in the metadata to pull from a DB or something along those lines. I do exactly what he does aswell and i have an Id metadata value that does exactly that pointing to an id in a DB which holds the text chunks and their respective metadata

by ewild

1/15/2025 at 6:27:17 AM

The original chunk, sure, but what if the original chunk is full of eg pronouns? This is a problem I haven't heard an elegant solution for, although I've seen it done OK.

What I mean is, how can you derive topics from a chunk that refers to them only obliquely?

by gavmor

1/15/2025 at 7:14:57 AM

Before chunking, run coreference resolution to get rid of all of your pronouns and replace them with explicit references. You need to be a bit of careful to ensure you chunk both processed and unprocessed versions in the same places but it’s very doable.

If you haven’t seen it, there’s a lovely overview of the idea in one of the SpaCy blog posts: https://explosion.ai/blog/coref

by gearhart

1/15/2025 at 7:38:55 AM

Oh wow, yes, this is clever!

by gavmor

1/15/2025 at 10:48:06 AM

HyDE is the way to go! Just ask the model to generate a bunch of hypothetical answers to the question in different formats and do similarity on those.

Or even better, as the OP suggests standardise the format of the chunks and generate a hypothetical answer in the same format.

by DigitalNoumena

1/15/2025 at 3:15:36 AM

Just want to say how great I am for calling this out a few months ago https://news.ycombinator.com/context?id=41470605

by anArbitraryOne

1/15/2025 at 10:56:49 AM

It's nice to hear that! And from this thread, it is not us only two—otherwise, the title wouldn't have resonated with the Hacker News community.

This blog post stemmed from my frustration that people use cosine distance without a second thought. In virtually all tutorials on vector databases, cosine distance is treated as if it were some obvious ground truth.

When questioned about cosine similarity, even seasoned data scientists will start talking about "the curse of dimensionality" or some geometric interpretations but forget that (more than often) they work with a hack.

by stared

1/16/2025 at 2:07:53 AM

Your post was much better than my stupid comment, and I like the points you articulated. Cheers.

by anArbitraryOne

1/15/2025 at 10:31:07 AM

You called it! But it is a pattern as old as the hills in the software industry. "Just add an index". "Put it in the cloud" "Do sprints". One size fits all!

by nejsjsjsbsb

1/15/2025 at 1:49:13 PM

That was a helpful list, in your second comment downthread. What are your top 3 metrics that perform the best on the greatest number of those features that make cosine distance perform poorly?

by khafra

1/16/2025 at 2:09:46 AM

Good question. Unfortunately, I'm just a keyboard warrior asshole that bad mouths things without offering solutions

by anArbitraryOne

1/15/2025 at 2:12:10 PM

The real problem with LLMs is that you can't get a probability estimate out of "Is {sentence_a} a plausible answer to {sentence_b}?"

See https://www.sbert.net/examples/applications/cross-encoder/RE...

by PaulHoule

1/15/2025 at 3:48:14 PM

With an open model, you could probably reverse engineer the token probabilities and get that probability estimate.

Something like: "Is {sentence_a} a plausible answer to {sentence_b}? Respond only with a single yes/no token" and then look at the probabilities of those.

by datadrivenangel

1/15/2025 at 5:51:54 PM

If the model is not open turn up the temperature a bit (if the API allows that) and ask the above question multiple times. The less sure the model is the more the answer will vary.

by wongarsu

1/15/2025 at 4:41:54 PM

Absolutely you can. Rip off the last layer, add a regression layer in it's place, fine tune.

by danielmarkbruce

1/16/2025 at 1:23:26 AM

Of course one can just ask the LLM for the output probability. It will give a reasonably calibrated output, typically a multiple of 0.05. I would ask it for an integer percentage though.

by OutOfHere

1/15/2025 at 7:02:24 AM

That's also why HyDE (Hypothetical Document Embeddings) can work better when the context isn't clear. Instead of embedding the user question directly – and risk retrieving chunks that look like the question – you ask a LLM to hallucinate an answer and use that to retrieve relevant chunks. Obviously, the hallucinated bits are never used afterwards.

by cranium

1/15/2025 at 10:05:50 AM

AFAIK retrieving documents that look like the query is more commonly avoided by using a bi-encoder explicitly trained for retrieval, those generally are conditioned to align embeddings of queries to those of relevant documents, with each having a dedicated token marker, something like [QUERY] and [DOC], to make the distinction clear. The strong suit of HyDE seems to be more in working better in settings where the documents and queries you're working with are too niche to be properly understood by a generic retrieval model and you don't have enough concrete retrieval data to fine-tune a specialized model.

by miven

1/15/2025 at 2:16:14 PM

Without concrete examples of this high dimensional roulette gone wrong, this is more of a "potential pitfalls" article that's been dramatically framed as a warning about danger.

Had the article been framed as "Improving RAG Retrieval Beyond Basic Cosine Similarity," its insights would be aligned with its actual content.

by Jimmc414

1/14/2025 at 11:37:22 PM

Typo: "When we with vectors" should be "When we work with vectors" I think.

by abstractbill

1/15/2025 at 8:51:39 AM

Thx, fixed!

by stared

1/15/2025 at 2:48:54 AM

Occasionally I'll forget a famous quote [0] so I'll describe it to an LLM but the LLM is rarely able to find it. I think it's because the description of the quote uses 'like' words, but not the exact words in the quote, so the LLM gets confused and can't find it.

Interestingly, the opposite conclusion is drawn in the TFA (the article says LLMs are quite good at identifying 'like' words, or, at least, better than the cosine method, which admittedly isn't a high bar).

[0] Admittedly, some are a little obscure, but they're in famous publications by famous authors, so I'd have expected an LLM to have 'seen' them before.

by nomilk

1/15/2025 at 8:09:06 AM

That's not how llm training and recall works at all so I'm not surprised you are not getting good results in this way. You would be much better using a conventional search engine or if you want to use an llm, use one with a search tool so it will use the search engine for you.

The problem you're encountering is not the model being unable to determine whether a quote it knows is responsive to your prompt but instead is a problem to do with recall in the model (which is not generally a task it's trained for). So it's not a similarity problem it's a recall problem.

When LLMs are trained on a particular document, they don't save a perfect copy somehow that they can fish out later. They use it to update their weights via backpropogation and are evaluated on their "sentence completion" task during the main phase of training or on a prompt response eval set during instruction fine tuning. Unless your quote is in that set or is part of the eval for the sentence completion task during the main training, there's no reason to suppose the LLM will particularly be able to recall it as it's not being trained to do that.

So what happens instead is the results of training on your quote update the weights in the model and that maybe somehow in some way that is quite mysterious results in some ability to recall it later but it's not a task it's evaluated on or trained for, so it's not surprising it's not great at it and in fact it's a wonder it can do it at all.

p.s. If you want to evaluate whether it is struggling with similarity, look up a quote and ask a model whether or not it's responsive to a given question. I.e. give it a prompt like this

   I want a quote about someone living the highlife during  the 1960s.  Do you think this quote by George Best does the job? “I spent a lot of money on booze, birds, and fast cars. The rest I just squandered.”

by seanhunter

1/15/2025 at 9:34:59 AM

Funny, I've done this a few times with good results. I guess it depends on the quote and the amount of context you give it.

This is also a case where something like perplexity might yield better results because it would try to find authoritative sources and then use the LLM to evaluate what it finds instead of relying on the LLM to have perfect recall for the quote. Which of course can fail in the exact same way my own brain is failing me (mangling words, mixing up people, etc.). It's something that works surprisingly well. I pay for Chat GPT and I don't pay for perplexity. But I find myself using that more and more.

by jillesvangurp

1/16/2025 at 5:56:56 AM

I suppose one shouldn't be surprised that an LLM-adjacent technology is treated like magic.

I wonder though, for cases where you genuinely are trying to match like to like, rather than question to answer, is vector embeddings with cosine similarity still the way to go?

My understanding, as stated in TFA, is that if you put careful thought (and prompt engineering) into the content before vectorisation, you can get quite far with just cosine similarity. But how far has "tool use" come along, could it be better in some scenarios?

by rglynn

1/15/2025 at 3:53:50 PM

I'm a little suspicious of the Isaac Newton example. The values of the better answer are very close, I wonder if the ordering holds up against small rewordings of the prompt?

Another approach if you're working with a local model is to ask for a summary of one word and then work with the resulting logits (wish I could find the article/paper that introduced this). You could compare similarity by just seeing how many shared words are in the top 500 of two queries, for example.

by gsuuon

1/16/2025 at 4:46:56 PM

These are very interesting and well-founded arguments. However, I believe the solution should not be around LLMs. As rightly pointed out in the article, the proposed solution is prohibitively expensive. How could I instead reformulate the approach to align with my specific definition of similarity?

by Dewey5001

1/15/2025 at 4:58:20 AM

In ML everything is a tradeoff. The article strongly suggests using dot product similarity and it's a great metric in some situations, but dot product similarity has some issues too: - not normalized (unlike cosine simularity) - heavily favors large vectors - unbounded output - ...

Basically, do not carelessly use any similarity metric.

by mlepath

1/15/2025 at 6:52:49 AM

Traditional word embeddings (like word2vec) were trained using logistic regression. So probably the closest would be σ(u.v), which is of course nicely bounded.

(The catch is that during training logistic regression is done on the word and context vectors, but they have a high degree of similarity. People would even sum the context vectors and word vectors or train with word and context vectors being the same vectors without much loss.)

by danieldk

1/15/2025 at 5:35:33 PM

In 3D graphics and physics its importance and accuracy are obvious, but I agree that with ML the vector space represents so many different things that using a dot product feels fuzzy. However I barely have any knowledge of ML, so this is just blind intuition of mine based on assumptions.

by harha_

1/15/2025 at 8:09:55 AM

Say I generate embeddings for a bunch of articles. Given the query "articles about San Francisco that don't mention cars" would cosine similarity uprank or downrank the car mentions? Assuming exclusions aren't handled well, what techniques might I use to support them?

by romanhn

1/15/2025 at 12:09:36 PM

It is up for testing, but you likely get the effect of "don't think about a pink elephant." So I guess that for most embedding models, "articles about San Francisco that don't mention cars" are closest to articles about SF that mention cars.

The fundamental issue here is comparing apples to oranges, questions, and answers.

by stared

1/15/2025 at 2:37:47 PM

So is LLM pre/post-processing the best approach here?

by romanhn

1/15/2025 at 8:14:20 AM

I think you have to separate it into negative query and run (negative) rank and combine results yourself.

by mirekrusin

1/15/2025 at 5:34:39 PM

no this wont work. embedding models at the moment are pretty bad with negations

by breadislove

1/15/2025 at 12:35:11 PM

> These vectors are quite long - text-embedding-3-large has up 3072 dimensions - to the point that we can truncate them at a minimal loss of quality.

Would it be beneficial to use dimensionality reduction instead of truncating? Or does “truncation” mean dimensionality reduction in this context?

by perfmode

1/15/2025 at 4:54:26 PM

The way that the embedding is done is using Matryoshka Representation Learning, truncating it allows to compress while losing as little meaning as possible. In some sense it's like dimensionality reduction.

by sc077y

1/15/2025 at 3:37:10 PM

An argument could be made truncation is a sort of random projection, though it probably depends on how the embedding was created, and a more textbook random projection is likely going to be more robust.

by marginalia_nu

1/14/2025 at 11:35:30 PM

Very interesting article. Is there any model that can generate embeddings given a system prompt? This can be useful not only for similarity searching but also for clustering use cases without having to do too much custom work. Essentially, a zero shot embedding model.

by pooyak

1/15/2025 at 12:17:08 AM

There are many embedding models supporting instructions. https://huggingface.co/spaces/mteb/leaderboard

by Loranubi

1/15/2025 at 12:15:26 AM

I may have missed the point of your question, but there are many generators of embeddings:

https://openai.com/index/new-embedding-models-and-api-update...

by petesergeant

1/15/2025 at 3:23:59 AM

By the way, I just wanted to say I really like your post! It’s well-reasoned, clear, and the use of images makes it super easy and enjoyable to read. So visually pleasing!

by Nataliaaaa

1/15/2025 at 1:20:38 AM

> "Is {sentence_a} similar to {sentence_b}?"

I also find this methods powerful. I see more and more software is getting outsourced into LLM judgements/prompts.

by nikolayasdf123

1/15/2025 at 9:22:35 AM

The problem is to scale that properly. If you have millions of documents, that won't scale that well. You are not going to prompt the LLM millions of times, aren't you?

Embedding models usually have fewer parameters than the LLMs, and once we index the documents, their retrieval is also pretty fast. Using LLM as a judge makes sense, but only on a limited scale.

by kacperlukawski

1/15/2025 at 10:29:12 AM

I love the cartoon because the "espresso" is coming from a French press.

by nejsjsjsbsb

1/15/2025 at 10:36:15 AM

According to text-embedding-3-large, "french press" and "espresso" have a cosine similarity of 0.506. More than half, so we are fine, right? Right?

by stared

1/16/2025 at 2:44:09 AM

AI will take all the coding jobs but baristas will be fine.

by nejsjsjsbsb

1/15/2025 at 2:07:45 PM

In general this is a cool article but I worry when TFA observes that cosine similarity and the dot product are the same under certain conditions (specifically that the vectors are unit vectors). It's really important that people who are using these measures understand something as basic as this. It shouldn't need to be said in an article but I feel (as the author obviously also does) that it does need to be said because people are just blindly using it without understanding it at all.

Cosine similarity literally comes from solving the geometric formula for the dot product of two Euclidian vectors to find cos theta, so of course it's the same. That is to say

a . b = |a||b| cos theta

Where a and b are the two vectors and theta is the angle between them. Therefore

cos theta = (a . b)/(|a||b|)

TADA! cosine similarity.[1]

If the vectors are unit vectors (he calls this "normalization" in the article) then |a||b| = 1 so of course cos theta = a . b in that case.

If you don't understand this, I really recommend you invest an afternoon in something like khan academy's "vectors" track from their precalculus syllabus. Understanding the underlying basic maths will really pay off in the long run.[2]

[1] If you have ever been confused by why it's called cosine similarity when the formula doesn't include a cosine that's why. The formula gives you cos theta. You would take the arccosine if you wanted to get theta but if you're just using it for similarity you may as well not bother to compute the angle and just use cos theta itself.

[2] Although ML people are just going to keep on misusing the word "tensor" to refer to a mere multidimensional array. I think that ship has sailed and I just need to give up on that now but there's still hope that people at least understand vectors when they work on this stuff. Here's an amazing explanation of what a tensor actually is for anyone who is interested https://www.youtube.com/watch?v=f5liqUk0ZTw

by seanhunter

1/15/2025 at 9:31:48 PM

A few nitpicks:

First, I am not sure whom you refer to - as (I hope) everyone who uses cosine similarity has seen a . b = |a||b| cos theta. I read its very name, "cosine (of the angle between vectors) used as a similarity measure".

Second, cos theta = (a . b)/(|a||b|) is pretty much how you define the angle between vectors, when working in Hilbert spaces.

Third, you pick a very narrow view of tensor when it is based on spatial coordinates (and so you get covariant and contravariant indices). But even in physics, this notation is broader - e.g. in quantum physics, a state of two-qubit lives in the tensor product space of two single-qubit states. Sure, both in terms of states and operators, you have a notion of covariance and contravariance (bras and kets, respectively). In mathematics, it is even broader - all you need is two vector spaces and ⊗.

In terms of deep learning (at least in most cases), there is no less notion of co- and contravariance. Yet, the tensor product makes sense, as (say) we can have an outer product between the sample and channels. Quite a few operations could be understood that way, e.g., so-called 1x1 convolutions that mix channels but do not do anything spatially and channel-wise.

A few notes here:

https://github.com/stared/thinking-in-tensors-writing-in-pyt...

by stared

1/16/2025 at 5:38:18 AM

That’s an excellent reference. Thank you.

by seanhunter

1/15/2025 at 9:45:48 PM

> Although ML people are just going to keep on misusing the word "tensor" to refer to a mere multidimensional array.

Could you elaborate on the difference?

I was under the impression that beyond the fact that arrays are a computer science concept and tensors are more of a math/physics concept, for all intents and purposes, they are isomorphic.

How is a tensor more than just a multidimensional array?

by billmcneale

1/16/2025 at 5:34:00 AM

Definitely not an expert so I’m on a journey learning this stuff, but as I understand it at the moment, a multidimensional array can represent a tensor, but to be a tensor, a multidimensional array needs the specific additional property that it “transforms like a tensor” that is, that as you apply some transformation to its components, that its basis vectors transform in such a way as to preserve the “meaning” of the tensor. An example will make this clear. Say I am in manhattan and I have a vector (rank 1 tensor) which points from my current position to the top of the empire state building. I can take the components of this vector in cartesian (x,y,z) form and represent it that way as ai + bj + ck where i,j, and k are the Cartesian basis vectors. However I can use another representation if I want to. Like say I transform this vector so I’m using spherical coordinates, the basis vectors will transform using the inverse of whatever transformation I did on the xyz components so the new basis vectors multiplied by the new components will give me the exact same actual vector I had before (ie it will still point from me to the empire state).

by seanhunter

1/16/2025 at 9:41:29 AM

Replying to myself to explain: - The components of the vector (in whatever coordinate system) are simply an array

- The combination of components + basis vectors + operators that transform components and basis vectors in such a way as to preserve their relationship is a tensor

In ML (and computer science more broadly), people often use the word tensor just to mean a multi-dimensional array. ML people do use tensor products etc so they maybe have more justification that some folks for using the word but I'm not 100% convinced. Not an expert as I say.

by seanhunter