Better Models: Worse Tools

7/5/2026 at 12:15:22 AM

This is easily solved with good error messages.

Claude always gets the syntax wrong on my tool calls.

So I did a revolutionary thing and made the error output print helpful guidance on how to correctly call the tool.

The agent tries again and always gets it right. Total time “wasted”: 1-2 seconds. It happens every session, but it only happens once per context window. After that the agent holds on to the lesson.

To do this for your own tool calls, imagine what you’d do in the agent’s place - what info you’d need so you can correct your mistake. Assume the agent wants to achieve the goal so it’ll try again. These are probabilistic systems, so we need to give them an extra loop to get the deterministic bits right.

by cadamsdotcom

7/5/2026 at 3:39:12 AM

I've been trying to push for this perspective about the error messages of jj vcs. There's some push back from people that don't perceive that making tools work well with LLMs is also making tools work well with humans. (Obviously there's more nuance to the arguments than this one sided perspective).

by joshka

7/5/2026 at 2:55:05 AM

This will cause an extra round trip to the LLM. Which means more $ spent.

by psadri

7/5/2026 at 3:12:45 AM

Better a round trip than bad/incorrect results. Also, the cache should kick in so the cost will be minimal.

by sdesol

7/5/2026 at 9:09:39 AM

So? What alternative do you suggest? Let the LLM get it wrong forever? Remove the tool? Automatically try to patch the syntax?

Almost no "solutions" in engineering/programming comes for free, one way or another, it's all a balancing act between different solutions with different tradeoffs. In this case, another request/response seems preferable to the other tradeoffs.

by embedding-shape

7/5/2026 at 2:36:17 PM

The counter intuitive pattern I see emerging is if you can cleanly determine intent, of the call you fix the call and prepend informative text to the tool call response indicating the mistake made and how to fix in the future then followed by the actual tool call. In this case you can validate fields and rather than throw a hard error determine if it's an extra field that isn't needed. If so you correct the call and prepend a corrective response in the tool call. This saves turns, it instructs the model in context so less likely to happen later and helps models that aren't so good at recovering from bad tool calls and staying on their longer horizon agentic task (most non openai and anthropic models)

by leemoore

7/5/2026 at 11:32:40 AM

Maybe I'm dumb but I believe I can think of one totally free solution here..

by beepbooptheory

7/5/2026 at 9:49:55 AM

Pi already emits good errors messages; I always see Claude Opus 4.8 correct itself in its next attempt when it gets a tool call wrong.

by euiq

7/5/2026 at 2:11:08 PM

So, is this part of the tool definition, or did you create your own coder agent?

by Shorel

7/5/2026 at 4:49:25 AM

I've built a library that makes creating rich feedback systems easier, check this out:

https://tool2agent.org/

by klntsky

7/5/2026 at 8:16:16 AM

Okay, but I solved this with a print statement.

by cadamsdotcom

7/5/2026 at 9:15:06 AM

My assumption was that it is often not convenient if you have a lot of logic. I used it internally, and the complexity of my use cases was barely enough to justify it too. But I've seen systems where it would definitely be a value unlock if I had to integrate LLM chatbots into them

by klntsky

7/5/2026 at 12:47:07 PM

very cool.

by try-working

7/5/2026 at 1:26:35 AM

So, are you saying that skills are not such a good tool for agents to learn, they still need tool-trial-and-error dance after injecting them? (I'm assuming each tool comes with its own skill.)

by siwatanejo

7/5/2026 at 1:52:55 AM

> they still need tool-trial-and-error dance after injecting them?

It honestly depends on the model. For my pi-brains extension for pi

https://github.com/gitsense/pi-brains

I've found after the first hook injection they get it, but there are occasions it can forget, but since everything is driven by hooks, you can inject as often as needed.

The issue with skills is, they are a one time thing, so you really can't use skills to correct haviorial issues.

by sdesol

7/5/2026 at 8:52:05 AM

Tools come with a tool description in json schema format, but yes your point stands, it is not enough for opus 4.8 which I've also noticed having tool call issues.

by Bolwin

7/5/2026 at 1:40:12 AM

I do not need to waste tokens on skills, I use Claude Code hooks.

Have a look at the TDD guard at https://codeleash.dev - the scripts/tdd_log.py arguments are pretty specific but it also has guidance in CLAUDE.md and lots of helpful error messages.

by cadamsdotcom

7/5/2026 at 4:37:31 AM

May I know when should skills be used over hooks and vice versa?

by 8cvor6j844qw_d6

7/5/2026 at 6:41:28 AM

Hooks provide determinism.

Hooks can run code.

Hook code can be written in advance by the agent, runs in milliseconds, costs zero tokens, and gives the same result everytime.

Agents live at the boundary of codification; anything codifiable should be codified rather than run through an unpredictable machine. Hence, use hooks when you want determinism & predictability & certainty.

Examples: your stop hook could run tests against the code that’s just been written. Now, if your agent docs also tell your agent that the stop hook will run tests and there’s no need to run tests itself, then it’ll trigger a stop when it’s done instead of running tests itself. Just be sure to change the exit code to 2 and route the test failure output to stderr so Claude Code will show that output to the agent. Because the stop hook will fail over and over until tests pass, you just created a very simple guard that guarantees tests pass before you see the code - your agent can’t stop working without passing tests!

by cadamsdotcom

7/5/2026 at 5:57:29 AM

Hooks are for doing AoP style wrapping of your interactions with the harness. Type /hook on the console see what is available. Have CC analyze your session and suggest converting part of your workflow to a hook, and then have it test it.

by thx67

7/5/2026 at 5:54:30 AM

This maneuver requires you to anticipate all the edge cases or error messages beforehand which is practically not possible in many situations. The moment something unanticipated happens or the model changes its processing logic, the tool call system stops working just like any other deterministic program or tool.

by pyeri

7/5/2026 at 7:21:48 AM

> This maneuver requires you to anticipate all the edge cases or error messages beforehand which is practically not possible in many situations. The moment something unanticipated happens or the model changes its processing logic, the tool call system stops working just like any other deterministic program or tool.

Not all; error messages are part of UX design, and the user error message should always give an error that indicates what the user can do to fix the problem.

If you cannot open a file for writing, don't just return "error: cannot open MyFile.txt", return "MyFile.txt: permission denied" (so user can request additional permissions from whoever), "MyFile.txt: no space left on device" (so user can free up some space), "Myfile.txt: file exists and is a directory" (So user can retry with a different name, or remove the directory, etc).

I think what is happening now is that, with so many of the agent-using pool of devs having never shipped to end-users before, they are surprised that their "program" (the tool) is being used wrong by the end-user (the LLM).

Those of us with battle-scars already expect the user to use it wrong and have learned that it's easier to tell the user how to fix the problem than to ask the user to read the manual/do it the correct way.

by lelanthran

7/5/2026 at 10:40:20 AM

So much this. I tell my juniors: To a beginner programmer, errors are 'the end'. They feel they did their best, it is not their fault and that is the error message they print. Experienced programmers know the user struggle, for them an error message is 'a beginning'. The first step of the user striving to solve the problem. They gave that command and they did not give it to fail. They (the users) still want to teach their goal.

Pro tip: Don't just print the return code, also print the call and it's arguments that failed, even without a stack trace.

by jeffreygoesto

7/5/2026 at 8:24:10 AM

Just add a --verbose flag that shows the stacktrace when there is an error. Then add a footer message when an error appears in non-verbose mode that invites the user/agent to use --verbose to get the full picture.

It obviously may end up in thousands of tokens burned through though (you can also fix that adding different levels of verbosity), but hopefully errors are not common.

by mrbungie

7/5/2026 at 8:18:31 AM

Are we talking about the same thing?

If the agent uses the to incorrectly, validation fails.

If validation fails for ANY reason, print a message saying “here’s how to use it correctly”.

You don’t need to anticipate every misuse, just validate your inputs.

by cadamsdotcom

7/5/2026 at 5:51:57 AM

same findings here, it'll doom loop without the proper error messaging. really expensive without error logging that gets propagated back to the agent

by StrugglingDev

7/5/2026 at 1:18:42 AM

LSPs and linters serve the same purpose. I use the latter in git hooks.

by esafak

7/4/2026 at 10:19:29 PM

When building agent integration for my serverless backend https://saasufy.com/, I decided to not use MCP but to put curl commands inside skill markdown files instead: https://github.com/Saasufy/skills

The curl command is extremely popular so models seem to be really good at using it.

Also I like that curl uses a bash syntax and my platform requires JSON payloads; it makes the separation clear to the agent. I find it to be very reliable.

by socketcluster

7/4/2026 at 11:18:25 PM

The skills are very readable too, so you win a nice documentation for free. At the very least it's human readable machine instructions.

by gchamonlive

7/4/2026 at 9:18:19 PM

As critical as I am about articles endlessly concerned with the weaknesses of closed-source cloud LLMs, this one is pretty great, and not just because it concerns interactions with Pi, which looks to me like it's going to end up a sort of quasi-reference implementation of an open source harness, and because it has so much useful technical detail.

But:

"Now I’m somewhat worried about the track we’re on here. Alternative tool schemas might not just be unfamiliar. They might be implicitly punished by post-training that optimizes for one particular, forgiving tool ecology."

Only implicitly?

Many decades ago when I was working on research related to using MOOs as a learning environment, you would add "tool calls" into the stream of text that a MOO object might generate, so your rich client would e.g. show a picture, load a web page in a frame, move you on a map, trigger a change in an on-screen representation of an object.

Everyone who tried this in MUD/MUSH/MOO clients ran into more or less the same problems that LLM clients do: any attempt to shoehorn control sequences into in-band content was riddled with security risks, objects accidentally triggering the wrong interface etc.; you could never truly communicate out-of-band.

The more I read about how agentic harnesses work, the less embarrassed I feel about the code twenty-something-year-old me wrote in a MOO client.

by dofm

7/5/2026 at 9:38:36 AM

Once models get better, we could avoid paying for a cache read on edit or write calls, and have the model assume they succeeded and not interrupt the stream to get output. We can then just parse the output and once we encounter such a silent toolcall execute it. With high probability its correct (glm in pi for me had 95% tool call success rate) and we can continue, else rewind. As a workaround, you dont want to use the provider feature that interrupts the stream after a tool call, but instead parse the reasoning. I tried this in pi and it kind of worked, but the model got confused about whether edits had been applied and in several runs either double checked or used the bash tool instead, negating any possible benefits.

by bazodedo

7/5/2026 at 12:02:04 AM

It's not the failed call that worries me. The call itself was correct, and the only thing off was a couple of invented fields. That makes the runtime feel like part of the model's interface rather than just an implementation detail. Train a model in a forgiving environment and other runtimes end up inheriting its habits.

by aberrahmane_b

7/4/2026 at 9:44:51 PM

> You can ask the model to produce valid JSON

Doesn't always work, for better performance you can kneel and start begging

by wseqyrku

7/5/2026 at 9:53:56 AM

Humiliation-assisted prompting. it's the future.

by automatic6131

7/5/2026 at 7:26:53 AM

Pi is my daily driver. I noticed the same phenomenon, and had Claude analyze all my past transcripts for classes of 'edit' error. Built an extension which patches the edit tool to self-heal on the majority of those kinds of calls. It's not 100%, but it cuts down on the rejections quite a bit and saves a few round trips.

EDIT: It's still quite fascinating seeing the kinds of things the models keep trying to do. It almost seems like when a human has slightly off with their nervous system. The conscious brain wants to do one thing, but for some reason the signals aren't getting to the hands correctly.

by pugio

7/5/2026 at 11:22:02 AM

There's a spectrum of possible explanation, from "this is a model training artifact which for now they correct via the harness" through to "this is deliberate, and creates a constantly moving target to make third-party harnesses less efficient for lock-in purposes".

I'd not discount the adversarial end of the spectrum.

by mft_

7/4/2026 at 9:29:47 PM

It sounds like harnesses might have to start to have model by model system prompts, though retrying works, I guess. It reminds me of the ancient times when browsers all read HTML and CSS differently, and differently on different devices. In that sense, this is nothing new. I was going to say, at least we don't have different device types, but then, the model still has to output the right variant of `grep` as well.

by lukasco

7/4/2026 at 9:44:19 PM

The problem with hyper targeting harnesses to models is that you end up locking yourself quite quickly into special behaviors of models, and you make your sessions non transferrable. That can be an acceptable trade-off and I know people who do that.

by the_mitsuhiko

7/4/2026 at 9:41:24 PM

The flip side of this is training models to better understand harness interaction, I suppose, which (if I understand it properly and I am in no way sure I do) appears to be what the Qwen AgentWorld model is doing?

by dofm

7/5/2026 at 2:45:38 PM

For the smaller local model I use (DS4 Flash) I just modified the tools to match the common error case of the model. If the mountain won’t come to Mohammed…

by arjie

7/4/2026 at 11:22:12 PM

Surprised models still output tools as text when for ages we’ve been able to constrain the output at the inference engine level and constrain the model what tools, parameters etc are available

Edit: found it, it’s called Grammar-Constrained Decoding (GCD)

by aetherspawn

7/5/2026 at 1:56:53 AM

I imagine the challenge comes from recognizing that your model is trying to call a tool before it actually has and only constraining output then. Running a separate pass for an optionally-empty list of tools afterwards may work, but maybe constraining its output like that causes many spurious tool calls.

by jdiff

7/5/2026 at 2:20:46 AM

Some model providers when using json_schema: true (eg. with_structured_output), it does constrain the output.

by miketery

7/5/2026 at 1:39:41 AM

constrained decoding tends to make models dumber - this is why it's rarely used

by CompleteSkeptic

7/5/2026 at 12:38:19 PM

Different but related: When you use a Codex subscription in an agent like Pi or OpenCode, all the requests and tool call execution go through a sandbox owned by Codex app server, and all the tool calls function somewhat differently, and you can't read files outside of the sandbox as easily. It's currently tripping me up a bit when building a model router.

by try-working

7/5/2026 at 12:44:48 PM

> When you use a Codex subscription in an agent like Pi or OpenCode, all the requests and tool call execution go through a sandbox owned by Codex app server

That is not the case. There are some subtle differences between subscription and regular inference API, but not to the degree that behaviors change entirely. In Pi we're doing tests against both API and subscription API regularly to see how they behave.

by the_mitsuhiko

7/5/2026 at 6:29:10 AM

A better solution might be not to constrain the generation, but to remove invalid fields from the tool call in the assistant message. So on the next turn, the model receives chat history which contains it's tool call, but without extra arguments. You can do that in OpenAI chat completions / responses, not sure about Anthopic API.

There is still a downside, sometimes the model really wants to include an additional argument for whatever reason it reasoned towards, and it needs the error message to understand that the argument doesn't exist. Otherwise if the argument is manually removed from it's tool call, the model will think that it accidentally left out the argument and start retrying and might go into a loop.

by big-chungus4

7/5/2026 at 7:44:28 AM

That kills the KV cache unfortunately. In some models (Gemini) I also doesn’t work because there is a signature on the model messages.

by the_mitsuhiko

7/5/2026 at 10:09:57 AM

LLM can write programs in any programming language it knows about. So how about askinng it to write a shell-program that does the tool-calls on the client?

You might want to run in some kind of sandbox to prevent the LLM from taking over the world, security is an isssue. But apart from that why not make the LLM write shell-programs instead of relying on JSON etc. ? Shell-scripting is the language for controlling the OS.

by galaxyLogic

7/5/2026 at 10:37:48 AM

Not always shell—Python, often a subset, is common—but a single sandboxed coding-running tool as the way to run all of the other things that would otherwise be their own top-level tools themselves seems an increasingly common approach.

by dragonwriter

7/5/2026 at 8:07:13 PM

I think Agents need user-accounts on your machine, and for us an easy way to configure the permissions each agent has, and then we dont' have to worry about sandboxing so much. This approach was pioneered by smart-phone OSes like Android.

Once permissions are safe and secure and easy to set up and understand then using shell-programming as a general interface between agents and your PC/OS might be a good option.

Come to think of it I recently read that Microsoft is planning to produce an AI-oriented OS. Maybe agents with user-accounts is something they are aiming for, to solve the problem discussed in the article.

by galaxyLogic

7/4/2026 at 9:59:05 PM

> In case you are curious about Fable: I intentionally did not test it because I was not sure if the classifiers they are running might downgrade me to Opus silently.

Is this still a thing? I thought Anthropic walked back the silent downgrades so now all the different domains downgrade non-silently.

by sestep

7/4/2026 at 10:09:25 PM

Claude Code downgrades loudly but I'm not sure what happens over API or with other harnesses, OpenRouter, etc.

by resonious

7/5/2026 at 1:26:59 AM

If I send an API call specifying model="Fable", is there a world where returning tokens not from Fable is anything but dishonest?

by fragmede

7/5/2026 at 2:11:14 AM

It's been clear for some time that model tool calling is heavily fit to a few common patterns, it's unsurprising that a tool call that looks the same or has the same name, but works differently, is falling back to priors and causing problems.

Things are not quite AGI yet; which is why people are now saying that intelligence is the harness + model, because the harness makes up for limitations in generalization.

by hsaliak

7/5/2026 at 5:53:44 AM

This has been the case since the early days. Aider had a bunch of code to be very forgiving with formatting of tool calls (file editing in particular at first). It's just the nature of the beast. It surprises me that Pi doesn't have a lot of this kind of stuff built in too

by afro88

7/5/2026 at 12:45:46 PM

I want dark window chrome and light contents but browsers seem completely unwilling to let me have this option.

by donatj

7/5/2026 at 3:03:53 PM

If you search "how to turn on dark mode" in chrome, it'll change to dark mode. Or you can change it in settings. For the numerous websites that don't care, extensions exist to force them. Midnight lizard is my favorite.

Edge has it in edge://flags no extension needed. Firefox & brave requires settings and extension like chrome. I haven't used opera or Vivaldi in ages to help with them, but they will have an option because customizability is a key part of their selling points

by eks391

7/4/2026 at 9:21:01 PM

In my harness i implemented apply_patch just taking unified diffs for patch -p1. I was shocked to see how bad models are at generating them. I started logging diff failures to analyse -

- All models are terrible at generating line numbers for a proper diff, give up on them

- Some models (Owl-alpha) must have been post-trained on Codex transcripts, because they occasionally push its V4A patch format into any diff tool available

- Codex puts a lot of info in its system prompt about the desired patch style, making larger hunks instead of granular ones, etc

by mappu

7/4/2026 at 10:00:33 PM

In my harness, I implemented tool_edit as a subset of Rob Pike’s Sam editor syntax [0].

Only need ~650 tokens of system prompt for it to work. It’s pretty stellar.

[0] https://9p.io/sys/doc/sam/sam.html

by fractorial

7/5/2026 at 12:15:39 AM

Yep. I spotted the same thing in piclaw (which relies on the pi runtime) but did not have time/energy to do a lot about it—and fable does the same, as far as I can tell, with one out of five or six edits failing. But I prefer OpenAI models for coding, so it wasn’t a real problem.

by rcarmo

7/4/2026 at 9:57:39 PM

This makes sense to me, much as I don't like it. IMHO the strategy taken by StrongDM's attractor coding agent seems like a path of least resistance. Directly target the LLM providers APIs and directly target their default tools.

by _doctor_love

7/5/2026 at 4:57:05 AM

I suspect this isn't a malfunction, but rather a deliberate measure designed to counter so-called distillation attacks.

by zxilly

7/4/2026 at 11:38:37 PM

Does Pi even need read/write/edit tools? Couldn't it just have bash commands and get the model to use e.g. sed for everything?

by xyzsparetimexyz

7/4/2026 at 11:50:09 PM

They do use these tools but they are not as efficient as codex multi-file patch which can perform file move, and edit in a single generation.

by _pdp_

7/5/2026 at 3:41:48 AM

closed source harness + RL fine tuning on customer prompts on said harness is becoming a kind of economic moat (?)

by 33MHz-i486

7/4/2026 at 11:01:32 PM

> [...] newer Claude models sometimes call Pi’s edit tool with extra, invented fields in the nested edits[] array

> My strongest hypothesis is that this is not random deterioration but a training artifact. [...] Anthropic’s own client appears to expect and accept a fair amount of slop and repairs it, mostly silently

> If reinforcement learning happens in a harness like that, or a simulation of one, then slightly malformed tool calls can still complete the task and receive reward.

> Worse, the model may become very strongly adapted to the canonical Claude Code edit tool shape.

> Tool schemas are somewhere in the distribution and some shapes are close to what the model saw during post-training and some are far away.

Great article.

Interesting root cause hypothesis. Couldn't one simply strip the slop-handling from the RL env's harness to avoid this though?

I do agree on the walled garden being built here. Proprietary frontier models performing best in proprietary harnesses makes sense for Anthropic's interests.

by wxw

7/5/2026 at 7:30:35 AM

My favorite feature from Claude code is the "auto" mode to dynamically approve permission queries that are reasonably safe. Unfortunately the standard pi sandbox extension doesn't support it. Pi should really build permissions into the agent leve.

https://github.com/carderne/pi-sandbox

by Onavo

7/5/2026 at 12:48:29 AM

I guess we are going to get even more of this. Where models and tools start producing nonsensical results and no-one understands why it appends and we must read articles like this that catch it.

by rvz

7/5/2026 at 3:35:24 AM

We’re entering the era of AI trained by previous generations’ slop. It’s not surprising that it’s sloppy.

by namuol

7/4/2026 at 9:21:17 PM

building deterministic tools on non-determinism is hard enough; try adding another layer where your cloud provider decides to massage the context, realigns it's permitted output, arbitrarily downgrades context to cheaper models, or they hire an MBA who determines your plan value can be tied to a degraded model under a new shrinkfied.

It's amazing anyone watched the last 2 decades of tech's enshitification and wants to hook their wagon to this shitshow.

by cyanydeez

7/4/2026 at 11:41:20 PM

[flagged]

by onchainbuilder

7/5/2026 at 4:59:06 AM

[flagged]

by openclawclub

7/4/2026 at 11:07:44 PM

[dead]

by sleepynoodle

7/5/2026 at 6:41:08 AM

[flagged]

by Elad-Rez

7/5/2026 at 1:22:21 AM

[flagged]

by Ozzie-D

7/4/2026 at 9:32:58 PM

Open source developer surprised and concerned by the trajectory their favorite proprietary software is taking.

by ares623

7/4/2026 at 11:38:13 PM

Hey, an article right up my alley! AI infrastructure/tools engineer here (hic-ai.com); my flagship product, HIC Mouse, is a precision-editing system for coding agents designed to work across a wide array of models and harnesses. Mouse provides 11 tools exposed via MCP for read-, find-, and edit-operations, using a coordinate-based schema (as well as exact and multiple string replacement), a Dialog Box inspect/refine/save/cancel changes functionality controlled by the agent to force staging and review of multi-operation or large edits before changes are written to disk, and extensive agent guidance mechanisms or guardrails to help the agent realize if it's about to do something potentially destructive or overly verbose.

I definitely think models may be trained to use particular popular harnesses or expect certain fields in the editing-tool or other tool schemas. Rather than trying to conform to (or force) one particular format, my approach instead is to design flexibly enough to handle a wide array of possible inputs and tool calls, but that also help the agent recover whenever its tool calls truly can't be salvaged and have to return etrors, and to auto-normalize results whenever reasonable to do so. It really does make a very dramatic difference (I wouldn't have bothered to launch if I thought it wasn't a meaningful advance) but anyway, just wanted to share my perspective given that I live and breathe this problem all day, every day.

by simonreiff

7/5/2026 at 12:06:33 AM

Very cool tool. As the "moar tokens" era is starting to wind down I think people are going to realize just how crappy these harnesses really are, especially Claude Code.

I have gone back and forth between Claude and Cursor and it is clear Claude just throws the kitchen sink at problems to get an edge. I write MCP tools and I see these exact problems when the inputs and outputs aren't clearly defined, the LLM just guesses and retries.

by lubujackson