Voyage-code-3 | alt.hn

1/14/2025 at 6:40:16 AM

For 2 years I built RAG apps for clients. They are not Code-Assistants but everytime I see code assistants solely relying on embeddings to find the right pieces as context, it feels wrong to me:

Code is very well structured. Based on a starting point (current cursor, current file, or results of an embedding search result) you would probably fair better to traverse the code tree up and down building a tree or using Abstract Syntax Trees (ASTs) as described in this blog post [4]. It's like a tree search in order to get relevant code pieces for a certain task, and it imitates what human coders do. It would integrate well into an agent loop to search relevant code.

Aren't there any open source code assistants and plugins that do this? All I see are embedding searches for the big projects such as cursor, cline or continue.

All I ever found were a few research efforts such as RepoGraph [1], CodeGraph [2] and one one codebase open sourced by Deutsche Telekom called advanced-coding-assistant [3]

1 https://github.com/ozyyshr/RepoGraph

2 https://arxiv.org/abs/2408.13863

3 https://github.com/telekom/advanced-coding-assistant-backend

4 https://cyrilsadovsky.substack.com/p/advanced-coding-chatbot...

by underlines

1/14/2025 at 6:20:44 PM

Absolutely! Code has vastly more useful structure than prose.

Aider exploits all this structure, using a "repository map" [0]. It use tree-sitter to build a call graph of the code base. Then runs a graph optimization on it with respect to the current state of the AI coding chat. This finds the most relevant parts of the code base, which aider shares with the LLM as context.

Your first link to RepoGraph is an adaptation of the actual aider repo map implementation. In their source code [1], they have some acknowledgements to aider and grep-ast (which is part of aider).

[0] https://aider.chat/docs/repomap.html

[1] https://github.com/ozyyshr/RepoGraph/blob/79861642515f0d6b17...

by anotherpaulg

1/14/2025 at 10:48:54 AM

Those are good points, we are currently experimenting with mixing both approaches storing in graph db and vector embeddings. One is giving structured output with querying possibilities (most LLMs are really good at generating Cypher queries) and somewhat general search with embeddings.

by khvirabyan

1/14/2025 at 7:07:37 AM

Definitely, why isn't there a code assistant that just uses the LSP to query the info you need straight from your language-specific tooling.

by littlestymaar

1/14/2025 at 9:50:31 AM

The LSP is limited in scope and doesn't provide access to things like the AST (which can vary by language). If you want to navigate by symbols, that can be done. If you want to know whether a given import is valid, to verify LLM output, that's not possible.

Similarly, you can't use the LSP to determine all valid in-scope objects for an assignment. You can get a hierarchy of symbol information from some servers, allowing selection of particular lexical scopes within the file, but you'll need to perform type analysis yourself to determine which of the available variables could make for a reasonable completion. That type analysis is also a bit tricky because you'll likely need a lot of information about the type hierarchy at that lexical scope-- something you can't get from the LSP.

It might be feasible to edit an open source LSP implementation for your target language to expose the extra information you'd want, but they're relatively heavy pieces of software and, of course, they don't exist for all languages. Compared to the development cost of "just" using embeddings-- it's pretty clear why teams choose embeddings.

Also, if you assume that the performance improvements we've seen in embeddings for retrieval will continue, it makes less sense to invest weeks of time on something that would otherwise improve passively with time.

by popinman322

1/14/2025 at 4:54:11 PM

> The LSP is limited in scope and doesn't provide access to things like the AST (which can vary by language).

Clangd does, which means we could try this out for C++.

There's also tree-sitter, but I assume that's table stakes nowadays. For example, Aider uses it to generate project context ("repo maps")[0].

> If you want to know whether a given import is valid, to verify LLM output, that's not possible.

That's not the biggest problem to be solved, arguably. A wrong import in otherwise correct-ish code is mechanically correctable, even if by user pressing a shortcut in their IDE/LSP-powered editor. We're deep into early R&D here, perfect is the enemy of the good at this stage.

> Similarly, you can't use the LSP to determine all valid in-scope objects for an assignment. You can get a hierarchy of symbol information from some servers, allowing selection of particular lexical scopes within the file, but you'll need to perform type analysis yourself to determine which of the available variables could make for a reasonable completion.

What about asking an LLM? It's not 100% reliable, of course (again: perfect vs. good), but LLMs can guess things that aren't locally obvious even in AST. Like, e.g. "two functions in the current file assign to this_thread::ctx().foo; perhaps this_thread is in global scope, or otherwise accessible to the function I'm working on right now".

I do imagine Cursor, et. al. are experimenting with ad-hoc approaches like that. I know I would, LLMs are cheap enough and fast enough that asking them to build their own context makes sense, if it saves on the amount of time they get the task wrong and require back&forth and reverts and tweaking the prompt.

[0] - https://aider.chat/docs/languages.html#how-to-add-support-fo...

by TeMPOraL

1/14/2025 at 7:11:55 AM

Similarly, Aider sends a repo map (ranked) https://aider.chat/docs/repomap.html

by lexoj

1/14/2025 at 3:20:21 AM

I don’t understand why we have so many companies working on vector databases when the real value is in fine tuned embeddings. Having recently evaluated vector DB’s to handle 100M+ nodes, everything will run you $1 - 8k per month. Yet using embeddings fine tuned on a specific use case, will lower that 10X since you can get away with way coarser vectors.

This makes so much intuitive sense - voyage please release an insurance focused model.

by serjester

1/14/2025 at 4:04:00 AM

Can you explain more about this, or point someone in the right direction with a few resources for reading more about it?

by digdugdirk

1/14/2025 at 8:24:00 AM

Can you recommend any resources/tutorials about fine-tuning embedding models? Which base model did you fine-tune?

by apricot0

1/14/2025 at 10:38:10 AM

What's an insurance focused model in this context? Sorry if it's a stupid question.

by ghxst

1/14/2025 at 10:55:07 PM

The answer is obvious. Vector DBs have clear line of sight. Experimental embeddings do not.

by parentheses

1/14/2025 at 4:30:41 AM

Currently using voyage for my project for the following reasons:

1. They are recommended by Anthropic on their docs

2. They're focused on embeddings as a service, I somehow prefer this other than spread-thin large orgs like OpenAI, GoogleAI, etc.

3. They got good standing on the huggingface mteb leaderboard: https://huggingface.co/spaces/mteb/leaderboard

by uncomplexity_

1/14/2025 at 2:11:48 AM

Their 1024 dimension outperforms their 2048 dimension? What am I missing here.

by qeternity

1/14/2025 at 2:15:52 AM

For float and int8, 1024 does indeed outperform 2048. However, for binary, 2048 outperforms 1024.

For our general-purpose embedding model (voyage-3-large), embedding vectors with 2048 dimensions outperform 1024 across the board: https://blog.voyageai.com/2025/01/07/voyage-3-large/

by fzliu

1/14/2025 at 4:24:13 PM

Do you have any insights into this? Perhaps scaling degradation with Matryoshka learning?

by qeternity

1/14/2025 at 5:18:16 PM

I used the embeddings from Voyage for a project in Swedish this summer.

The neat thing about Voyage was besides the speed of the service.

I think I had 250 million tokens and Voyage was the fastest. It took a couple of days on and off. I believe my napkin calculation showed that OpenAI would have taken months.

by maCDzP

1/14/2025 at 1:49:49 AM

What's their ranking on SWEBench?

by potatoman22

1/14/2025 at 2:08:16 AM

for an embedding model, this is a very very bad question, to the point where i am wondering if you are trolling and i missed the joke

i can be constructive but i dont have space in the margins of this paper

by swyx

1/14/2025 at 2:14:08 AM

Pardon my ignorance, but are you saying that embedding model leaderboards aren't useful? When I look at https://huggingface.co/spaces/mteb/leaderboard I see voyage's embedding model is at the top. Does that mean they have the best model?

by moojacob

1/14/2025 at 2:18:17 AM

Although voyage-3-m-exp is at the top of the leaderboard, I would not use it for production use cases unless your data is extremely similar to "in-domain" MTEB data. Check out this post from Nils (one of the authors of the original MTEB paper) for more info: https://x.com/Nils_Reimers/status/1870812625505849849

voyage-3-large will work better for almost all real-world production use cases.

by fzliu

1/15/2025 at 8:14:21 PM

Which model would you choose?

by moojacob

1/14/2025 at 2:21:57 AM

They did not claim that leaderboards were useless. There’s a big difference between an embedding model leaderboard and the SWEbench leaderboard though, given that they measure completely different categories of models.

by evilduck

1/14/2025 at 5:16:40 PM

Woops. I should have done more reading before posting.

by potatoman22

1/14/2025 at 2:13:29 AM

We have a repurposed version of SWE-bench Lite for retrieval; you can see the numbers in the last column of this spreadsheet: https://docs.google.com/spreadsheets/d/1Q5GDXOXueHuBT9demPrL...

by fzliu

1/14/2025 at 8:38:03 AM

Release the weights or buy an ad. This doesn’t deserve front page.

by zelcon

1/14/2025 at 2:33:41 AM

OpenAI is a big famous company. Why should I trust you guys with code, when I’m too lazy to clean out sensitive stuff?

by doctorpangloss

1/14/2025 at 3:18:00 AM

It’s Fei-Fei and Christopher Ré. These are serious people.

by benreesman

1/14/2025 at 7:53:53 AM

highly doubt either are deeply involved in the day to day lol

by swyx

1/14/2025 at 8:56:28 AM

Have some respect.

by benreesman