6/5/2026 at 12:22:30 PM
I would like to have deeper comparison with alternatives like rtk, which are already fast and written in rust, also the previous comments mentioned something that has been a know problem with rtk that it sometimes strips the thing that the llm needs (or expects, causing more work to need to happan not less)by alex7o
6/5/2026 at 3:14:42 PM
None of these tools measure how effective they are...It's a massive red flag to me when you could get decent data to see if your thing actually works, and they don't even attempt to...
Have the LLM use your tool, run it on several of the coding benchmarks. If you're stingy, run it on the ones that don't cost much.
Otherwise, I'm going to assume it doesn't actually work. If it did - Claude, Antigravity, Codex, Pi, or some major player would bundle tools like this into the CLI / harness.
AFAIK, none of the major players do. That's a sign to me these don't work in general.
I've tried building some tools specific to bug fixing. Intelligently feeding context massively helps smaller models. But, what I've found - surprisingly - is that a smaller, much better focused, including a lot of helpful data as well, has almost no impact on larger models compared to what they do by default.
You do save some tokens, though, which is what they're claiming - but not ~99%...
by onlyrealcuzzo
6/5/2026 at 10:23:45 PM
> otherwise, popular solutions would integrate the ideaNone of the major players are incentivized to care about this, especially not over other opportunities. Why would you expect them to integrate it?
One of the biggest wins you can institute for your own codebase if you use agents is writing your own harness, by a huge margin. The defaults are fine, but you can do better.
by hansvm
6/6/2026 at 5:25:05 PM
They're incentivised because they're offering plans at a loss and/or pricing out potential customers. All these LLM companies are competing on accuracy and price.by Cpoll
6/6/2026 at 12:23:04 AM
> The defaults are fine, but you can do better.Why can I do better than Pi?
I don't want to build my own harness and deal with the bugs... I want to build my project...
My understanding is that Codex / Claude / Gemini subscriptions don't work with custom harnesses.
It's pretty hard to beat 5x more usage if you have the $200/mo subscription by using the API instead.
by onlyrealcuzzo
6/6/2026 at 10:13:17 AM
If you're looking for an efficiency-focused harness, I had a pretty good time using the Dirac agent. The line-based anchors were slightly buggy though (this was a couple months ago) and would sometimes add the same line of code multiple times or leave an anchor in the output.by smallerize
6/6/2026 at 12:31:35 AM
Codex definitely does and Claude Max definitely doesn’t.by hboon
6/5/2026 at 6:46:22 PM
I don't think frontier model providers are going to be incentivized to invest in this much, yet. Once inference gets more competitive, sure. I haven't looked lately, but won't be surprised if tools like OpenCode do do what you're suggesting, though. Third-party coding harnesses ARE aligned to deliver this type of feature and optimization.by taude
6/5/2026 at 3:56:52 PM
It's too hard to define what "works" even means in this case. Look at the example savings output. A lot of it is kubectl output.Your suggestion to using coding benchmarks doesn't really capture the whole picture. I haven't seen a benchmark using kubectl.
> AFAIK, none of the major players do. That's a sign to me these don't work in general.
It's a lose/lose for major players. If it works well, it will lower their revenue. Also there's a high risk it'll significantly worsen results for some people, even if it improves results for others.
by doix
6/6/2026 at 2:46:42 AM
So often we will burn 20% of limit in a single ill conceived agent tool call that we're simply not going to be able to or want to be able to intercept. Where I see a tool like this being a real step forward is to add a decision point. it does not have to bubble up to hard-require user to provide permission, but it can let the LLM have an intermediate checkpoint to say that it's about to get blasted with 30k tokens and here is roughly the shape of it and do you wanna adjust or whittle it down if you know what you're looking for etc.?There is definitely tons of value to extract from this line of thinking.
by unphased
6/5/2026 at 4:49:21 PM
> I'm going to assume it doesn't actually work. If it did - Claude, Antigravity, Codex, Pi, or some major player would bundle tools like this into the CLI / harness.VS Code launched it as a feature in their bundled AI functionality last month: https://code.visualstudio.com/updates/v1_121
by no-name-here
6/5/2026 at 6:35:51 PM
Bundling implies interest...Defaults imply working...
by onlyrealcuzzo
6/5/2026 at 6:23:22 PM
My partial solution to this was to store the full response in a file and prompt the agent to read that if the condensed version had stuff missing.by irthomasthomas
6/5/2026 at 9:08:18 PM
This is the reason, when I built a tool in the same space, I chose to benchmark with cost per correct answer.Reducing tokens and also turns is quite worthless if the LLM doesn’t solve what you put it to do.
by jahala
6/5/2026 at 9:31:39 PM
Did you benchmark the competition and can we see?by esafak
6/6/2026 at 8:14:59 AM
No I don't have the funds to benchmark the competition, but would be happy to put the numbers up if any token whales feel like having a go.by jahala
6/6/2026 at 11:44:18 AM
Oh that is a nice approach whish more benchmarks did cost per successfulby alex7o
6/6/2026 at 12:00:05 AM
The problem even attempting to develop a tool for the frontier model space is that the cost to run a statistically significant benchmark is almost certainly going to be over $100 - for a single model.Unless something is like 25%+ more cost effective on Gemini for a task, I would not assume those savings are going to transfer to GPT.
If you need to run a test this expensive and slow for every release, hobbiests aren't going to do it.
And if you wanted any broadly specific improvements to coding like they all claim, the costs would be in the thousands per release even for a single for a single model.
And they almost certainly would not be eye popping.
If the models could be SUBSTANTIALLY better, Google and Anthropic and OpenAI wouldn't be finding that out from a hobbiest making wildly unscientific claims.
by onlyrealcuzzo
6/6/2026 at 8:18:28 AM
Yup, this is hitting it on the nose. But, despite the cost - the benchmark is the vital ingredient that cant be skipped. Otherwise, you don't know if what you're building is actually helping the agent rather than hindering it.On the previous large benchmark run, i proved 40-50% cost reduction per correct answer.
I'm not sure why the vendors aren't using token filtering/compression more in their tooling, but perhaps they don't mind users feeding them more data and using more data.
by jahala
6/5/2026 at 8:46:55 PM
You can't measure effectiveness, because you never know what kind of model will process your prompt. One request you might get full e.g. Opus and another they'll downgrade it to Sonnet or something more basic. I have this with "Opus 4.8" all the time.by varispeed
6/6/2026 at 12:51:20 PM
[dead]by poelzi
6/6/2026 at 4:54:18 PM
I have just put the comparison in the repo in case you want to checkout.by zdkaster
6/5/2026 at 1:18:35 PM
In term of token saving performance, it should be on par with rtk since it is basically the same idea. The major different is rtk bundled hundreds of filter logic and no room for user to adjust without maintaing user owned fork or opening the pull request while lowfat is using opposite architectural approach by removing almost all filter logic in the binary and seperate user filters as a plugin systemby zdkaster
6/5/2026 at 2:57:33 PM
Yeah I use rtk and would love to see a comparison.by giancarlostoro