Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks

5/20/2026 at 8:03:29 AM

I like work in this area, and this is really helpful, thanks. I actively avoid cloud based LLMs and mainly use 4b - 30a3b param local models. This means I don't really have a good grasp of SOTA LLM performance or accuracy, but I know what to expect when dealing with local models, and where the pain points are.

I've only skimmed the post and read the abstract and in some places you make a nod to how simple tweaks can make something 10x faster/slower, but then all of your metrics and data seem to focus 100% on accuracy. You need to address speed.

Specifically for agentic workflows and local models, accuracy around function/tool calling hasn't been a problem for me now for about 6 - 12 months, personally, since around QwenCoder3. The main issue is context management and the impact on timing, since agents will often swap prompts and break prompt caching and similar timing improvements.

It looks like your work adds a layers and wrappers like guard rails and retries. This would make my local model experience - specifically for agents - unusable because of the delays it would add.

I really appreciate and respect the work you've done, and apologies if you have already addressed this head on, but with so little talk about the impact on timing here, I feel like you're hiding something or overinflating the actual real world improvements here - what are your thoughts?

It's also mildly concerning me that nobody else has raised this - am I doing something wrong here, or is everyone else just not actually using local models in real life?! Talk to me about your speed experiences!

by 1dom

5/20/2026 at 12:52:37 PM

"I actively avoid cloud based LLMs… This means I don't really have a good grasp of SOTA LLM performance or accuracy…

…but then all of your metrics and data seem to focus 100% on accuracy. You need to address speed."

I wonder, if you were to use cloud-based LLMs more often, you might find that accuracy (fidelity?) is indeed more more lacking in your local models.

You can always just throw hardware at your speed problems after all.

by JKCalhoun

5/20/2026 at 3:42:33 PM

I agree accuracy isn't maybe the best word here, I used it as it was used in the original post, mainly a as a catchall for "everything but speed", so fidelity, perplexity, etc.

I also agree that if I spent more time using cloud based LLMs, I would very much find local LLMs less capable and useful. Comparison is the thief of joy though, and I'd rather feel blissfully ignorance towards SOTA LLMs rather than a dependence on them.

Before taking a local focus approach, LLMs increasingly left me feeling a mixture of FOMO, sadness and futility towards the future of software and tech. I assume it's 100% a me problem, but it has it's benefits:)

by 1dom

5/21/2026 at 4:53:27 AM

No, I'm a fan of local as well. For me though, there is just such a fascination that I can have something like this sitting on my own hard drive. It's okay that it's not a "frontier model".

by JKCalhoun

5/20/2026 at 12:48:10 PM

Hi! Latency is definitely a factor in any system, and the dashboard and paper do report elapsed time - but at the workflow level.

On a per-call basis, the wrappers are pure python ifs and such, measured in ms easily, and frankly negligible compared to the LLM call itself which will be on the order of magnitude seconds.

Where timing gets interesting is that forge will slow down workflows because the retries mean you don't error right away. Bare runs were failing fast in my experience. But on a per-call basis there's very little overhead.

I haven't detailed it simply because the order of magnitude of a single LLM call is so much higher than all the overhead put together.

by zambelli

5/20/2026 at 5:38:53 PM

Hi! Thanks for the response. Like I mentioned, I only skimmed, and it sounds like there's more to it than I understand, so I'll take a deeper look and see how it feels in practice.

> Where timing gets interesting is that forge will slow down workflows because the retries mean you don't error right away. Bare runs were failing fast in my experience. But on a per-call basis there's very little overhead.

> I haven't detailed it simply because the order of magnitude of a single LLM call is so much higher than all the overhead put together.

Yeah, that makes sense and seems fair. The sort of delays are almost and inevitability, you're not trying to improve speed, but by improving reliability, it can obviously increase overall throughput.

Having watched the demo video too now, automating retries etc would be helpful for me. It's impressive to see how quick the models run on better hardware, and the performance improvements are impressive, even if the overall run takes longer sometimes because it does more correct things. Thanks again!

by 1dom

5/20/2026 at 2:36:50 PM

> On a per-call basis, the wrappers are pure python ifs and such, measured in ms easily

Ah that's good to know

when I first saw this posted yesterday I was wondering that, kind of assumed maybe it was doing extra LLM calls to make judgements

by anentropic

5/20/2026 at 3:57:49 PM

Retry nudges do generate an extra LLM call, and those average extra calls time impacts are captured in the eval data.

But that's the difference between the call failing and succeeding (eventually).

On successful calls the presence of forge should be unnoticeable.

by zambelli

5/20/2026 at 11:28:41 AM

what does "30a3b" mean?

by NooneAtAll3

5/20/2026 at 11:52:05 AM

Yup, confirming what pamcake said, 30b with 3b active.

I have a laptop with a broken screen and an RTX2060 at my disposal. I can run 12b - 14b dense usably, just, although I think 4b - 8b dense models give me the best tradeoff of speed and usefulness.

Larger MOE models with more parameters (20b+) but fewer active (2 - 3b) are sometimes a little bit slower, but are often far more capable.

by 1dom

5/20/2026 at 11:40:21 AM

Guess: 30B MOA with 3B active

by pamcake

5/20/2026 at 10:03:03 AM

[flagged]

by claud_ia

5/19/2026 at 10:09:42 PM

I've been saying for a while that given a proper harness, small local models can perform incredibly well. When you have a system that can try everything, it will eventually get it right as long as you can prevent it from getting it wrong in the meantime.

by Escapade5160

5/20/2026 at 5:38:51 AM

The problem is that you get similar quality as if you gave a junior unlimited time to work on a problem and told them to keep trying different things until the goal is reached.

Even the SOTA models have this problem when the work is complicated enough. The problem is amplified more with the small models.

by Aurornis

5/21/2026 at 4:08:43 PM

One important facet of this is it’s not far from “giving unlimited juniors unlimited time…”

Where the limits are set by hardware for agentic execution (compute/network/storage) && inference speed

by coip

5/20/2026 at 12:52:00 PM

There's a lot of valuable things that can be done in that range, especially when token costs aren't a concern. Not every problem requires SOTA

by Zetaphor

5/20/2026 at 1:42:59 PM

> especially when token costs aren't a concern. Not every problem requires SOTA

If token costs aren’t a concern I’m using SOTA for everything.

Even SOTA gets it wrong and hallucinates, but at a lower rate. I don’t want to waste my time.

by Aurornis

5/20/2026 at 4:43:26 PM

I believe they mean token costs aren't a concern when you're not paying for a SOTA model via API, and are instead running local models.

Infinite monkeys on infinite typewriters, and all that.

by lixquid

5/20/2026 at 6:42:42 PM

Correct, I have local hardware, not infinite money.

by Zetaphor

5/19/2026 at 10:12:18 PM

Lol, I love that framing. Yeah, the small models have impressed me a lot during this work. The reasoning can be quite good, and definitely sufficient for a lot of cases. Just gotta nudge em back on track Every now and then and they'll figure it out.

by zambelli

5/19/2026 at 10:28:36 PM

If I understood correctly, the model will get it right because it knows when it isn't right.

by cornholio

5/19/2026 at 10:30:27 PM

Essentially, yes that's right! There's some subtlety in how to let it know it was wrong (returning things as tool errors because it trained on that), but that's the gist of it - sort of a self-correcting architecture.

by zambelli

5/20/2026 at 12:52:18 AM

https://en.wikipedia.org/wiki/Apophatic_theology

by tomjakubowski

5/20/2026 at 2:26:12 AM

I was expecting this https://knowyourmeme.com/memes/the-missile-knows-where-it-is

by jon_richards

5/20/2026 at 2:11:57 PM

the missile knows where it is because it knows where it isn't

by forlorn_mammoth

5/20/2026 at 10:49:38 AM

Prior art: https://ghuntley.com/ralph/

by andai

5/19/2026 at 10:59:57 PM

A thousand monkeys on a thousand typewriters…

by koolba

5/19/2026 at 11:40:45 PM

That is the whole challenge, actually! A new metric I'm going to dogfood into forge is ETTWS - estimated time to working solution.

A simple retry loop around your whole workflow could, in some cases, be all you need. But it could mean many blind attempts to get through a workflow successfully. And hopefully there isn't a payment step partway through!

The fewer hard errors nix the whole workflow, the lower your ETTWS.

by zambelli

5/20/2026 at 6:28:56 AM

Is it strange that I immediately interpreted ETTWS to be Estimated Time To William Shakespeare?

by killing_time

5/20/2026 at 3:32:51 PM

It's relevant to the "thousand monkeys on a thousand typewriters".

by Mithriil

5/21/2026 at 8:42:50 AM

The one true AGI metric!

by jononor

5/20/2026 at 5:22:37 AM

Have you read the MAKER/MDAP paper? 1 million sequential tasks.

by beacon294

5/20/2026 at 5:48:24 AM

No, I haven't - hadn't heard of it. I'll try to squeeze in a quick read in the coming weeks!

by zambelli

5/20/2026 at 2:30:16 AM

This is a thousand unusually smart monkeys who speak every major human language fluently and are proficient in every major programming language, but sometimes still make bizarre mistakes and need to be put back on track.

by DiogenesKynikos

5/20/2026 at 3:01:51 AM

This is fun for you?

by jplusequalt

5/20/2026 at 5:45:11 AM

I found it fun to read.

by bratbag

5/19/2026 at 7:58:44 PM

Tangentially related: Since you are at Texas Instruments, I wonder if you could find out what the status is of the intellectual property for the TI Explorer lisp machines. I know who owns the IP for Genera, but wasn’t able to find out about TI’s lisp OS

by jf

5/19/2026 at 8:04:02 PM

Very tangential! I'll try but it might take me a while.

by zambelli

5/19/2026 at 11:11:11 PM

Who owns the IP for Genera?

by user3939382

5/20/2026 at 4:51:11 AM

John C. Mallery of MIT

by jf

5/20/2026 at 1:23:05 AM

Had a couple thoughts in this realm, and am working them into my own harness. Curious to see what others think. I'm not sure if this is generalizable, as my harness is fairly specialized:

- Breaking down a problem into a planned execution, with executing agent providing the initial plan which includes explicit objectives such as which tools it calls and what it would consider to be a successful execution.

- The harness then executes the plan in order

- Each step that involves a tool call will be executed by breaking down the tool call into component parts: the harness interrogates the agent for a valid parameter value for the current tool argument. The tool definition contains validators for each argument. If the validator fails, the harness rewinds the conversation and injects the failure reason into the next try.

- Once the agent produces a valid response for the argument, the harness proceeds to the next argument.

- Once all the arguments have been filled, the harness calls the tool. It passes the agent's initial expected value along with the actual value, along with any errors that may have been produced and asks the agent if it is satisfied with the result. If it isn't, the agent provides a reason and the harness then retries the tool call process from the beginning rewinding the conversation and inserting the reasoning for the retry.

- The agent may request to re-plan if it discovers a flaw in its initial plan. The harness will also attempt to re-plan if the agent produces too many failures in a row.

This proves to be quite effective at reducing tool call failures. One benefit is that the sub-agent gets a perfect conversation history where it makes no mistakes. I'm not sure if it's actually better at completing tasks though, I haven't tried to benchmark it.

by lwansbrough

5/20/2026 at 1:30:03 AM

I went through a similar (in philosophy) exercise with my small-model agentic coding harness - built on forge.

A few things I noticed related to your points: - on conversation rewind, I implemented a similar tool call collapse on the main agent (the one you chat with). Once it was done with a task, the tool call history was collapsed to keep the context clean - it was more about hygiene than size.

- the harness interrogating the model bit is a bit different, I haven't tried that approach. Forge relies on model self-correction in a bid to avoid having bespoke error modes, but I guess if you can abstract and automate the interrogation based on schema or something that could work!

Overall I like the clean conversation history aspect, but I suspect that you might be doing a lot of round trips for tools with many args, versus "letting it fail and giving it one nudge". That being said, it's an interesting idea for harder scenarios/tasks!

by zambelli

5/20/2026 at 3:19:46 AM

I've been writing my own, out of curiosity, with gemma4. I've been surprised how far I'm getting.

by jvalencia

5/20/2026 at 3:35:59 AM

Very cool! Hopefully you'll share it someday!

by zambelli

5/20/2026 at 7:42:09 AM

Yes, I was thinking about the same approach because I have Strix Halo and it slows down with longer context so context with less than <10k tokens would be achievable this way. If this could be done with small model that have >50tk/s that would be huge.

Unfortunately I am caught up right now in other projects at work and otherwise and just tried few dozens of prompts to see if this is even achievable.

by npodbielski

5/20/2026 at 12:13:51 AM

> One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode.

I thought Llamafile was just a model and llama.cpp bundled in to a single binary - is this the difference between Llamafile injecting a default sysmtem prompt vs hitting the raw llama-server endpoint with no harness?

That seems like comparing apples to apple pie, there's some ingredients missing.

by seemaze

5/20/2026 at 12:45:33 AM

I was surprised as well. I did go with an extreme (but true) example in the post. In this case, native function-calling template likely is in play.

However, that doesn't explain the Lamaserver prompt vs llamafile at ~ +4pts, or vs Ollama (at ~ +30ish pts) that sits almost perfectly between llamaserver native and llamafile.

The backend affects almost all model families, and was just something I've never seen really talked about.

by zambelli

5/20/2026 at 1:40:03 AM

Do you have any suspicion about what is different between the backends?

That's an absolutely bonkers statistic: it would mean spurious differences in hosting container overwhelm the performance differences between models.

by eob

5/20/2026 at 2:16:11 AM

I genuinely don't, sadly. I'm a mathematician originally, evolved organically into ML then AI - but I never really was a SWE.

I feel like there's some backend decoding or chat template thing going on at a much lower level than what I'm best at. Maybe it's injecting headers or something that eventually compounds to model confusion? I really have no idea.

I really hope folks better than me at backend stuff take a look and dive into it though because it's definitely under-reported and super consistent across model families and backends ranging from ollama, lama.cpp native, prompt, llamafile, and even vLLM that I didn't formally benchmark in the repo.

by zambelli

5/20/2026 at 9:36:32 AM

Hey, this is most probably related to the chat template or the reasoning parser or the tool call parser or also things like kv cache quantization and possibly other params that affect results like the regular top k top p and all of that, the backend often sets its own defaults or the lack of them. It’s best to have all these under control if possible. I wonder regarding this project have you been testing it on real world projects? I’m working on an agentic loop as well also using a local model.

by kosolam

5/20/2026 at 12:57:49 PM

Yes I've now used it "in the wild" for a handful of use-cases. I still run into the backend thing even when declaring params though, which is odd to me. But there might be params not typically passed in with the model that backends are setting. Again, really not my area of expertise.

As for consumers, I've done a home assistant, an agentic coding harness, and an autonomous engineering project (still in flight).

by zambelli

5/20/2026 at 12:15:44 AM

I wouldn't expect such difference

by imachine1980_

5/20/2026 at 7:38:53 AM

Thanks for building what I'd hoped to find the time to build (and much better than what I would have made)! One question: do you think there is room for parallelization here, eg in the retry loop? Local models generally can handle a limited number (~ 2 digits) of concurrent requests pretty well, even on consumer hardware, which can give >10x boosts in the effective number of token/s. I've been thinking for a while about workflows that could take advantage of this, and 'fix this error' could be one (if not ideal) application. Would be curious what you think.

by c7b

5/20/2026 at 12:55:33 PM

Interesting - so you're thinking give the model two parallel shots at the tool call and take the winner if there is one, or fallback to retry if not?

That would certainly work in theory, but I'm not as familiar with parallel calls.

- If you mean the model calls the tool twice, identically, in a batch call - that would work fine and Forge handles batch calls, but many small models wouldn't think to do that so you'd have to explicitly prompt it to do so.

- If you mean ask the LLM twice to call the tool and look at both answers, my only concern would be latency from doing 2 calls instead of 1.

- Unless you're truly running 2 instances of the model and aren't memory-bandwidth bound, then yes running parallel workflows would likely help. Especially if you could have them compare notes at certain steps or something.

But I haven't explored this much at all so if you're thinking of something else, let me know!

by zambelli

5/20/2026 at 3:33:37 PM

So as an oversimplified PoC, I get:

    llama-parallel -m ~/models/Qwen3.5-4B-Q8_0.gguf -ns 4 -p "Fix this Python code, answer with code only: prnt('Hello World)" -pps

    llama_perf_context_print:        load time =    1181.90 ms
    llama_perf_context_print: prompt eval time =     190.57 ms /   374 tokens (    0.51 ms per token,  1962.49 tokens per second)
    llama_perf_context_print:        eval time =    3612.25 ms /   159 runs   (   22.72 ms per token,    44.02 tokens per second)
    llama_perf_context_print:       total time =    4302.84 ms /   533 tokens
    llama_perf_context_print:    graphs reused =        155

and four answers (3 of which are immediately usable), with -ns 1 I get :

    llama_perf_context_print:        load time =    1185.61 ms
    llama_perf_context_print: prompt eval time =     187.55 ms /   305 tokens (    0.61 ms per token,  1626.27 tokens per second)
    llama_perf_context_print:        eval time =     158.92 ms /     7 runs   (   22.70 ms per token,    44.05 tokens per second)
    llama_perf_context_print:       total time =     468.85 ms /   312 tokens
    llama_perf_context_print:    graphs reused =          6

Now this is probably not the right way to use it, you should probably also use vLLM instead and it's also not a good model to use for this. But there is a real effect here that others have demonstrated, that the GPU is apparently not always maxed out while handling a single request, so sending concurrent requests can yield substantial parallelization benefits. The idea with this application would be something like this: send off the same query in parallel requests, triggering parallel tool calls, and then filter the results (filter out all failing ones, rank the rest by some simple metric of code complexity). There are probably better applications as well, I'm basically just thinking what kinds of tasks could benefit from parallelization.

by c7b

5/20/2026 at 5:26:11 AM

This is fantastic. I haven't got any local inference as I can't afford it right now, but tool calling has been a concern for me with these smaller models through OpenRouter.

I've been working on a pytest-first acceptance testing framework called Dokimasia (do-kee-ma-see-ah) that I'd love to get your thoughts on: https://github.com/deevus/dokimasia

Acceptance testing might not be what you need for Forge, but since you're deep in AI tool building I thought you may have opinions.

by deevus

5/20/2026 at 5:38:01 AM

Oh, interesting idea. Formalizing an abstraction layer for testing all the integration types out there in the AI ether, essentially? MCP, skills, etc.

I think this sits a level higher than Forge - maybe testing the workflow proper and integration points that it might surface (if some tools are giving access to an MCP or something).

Could likely layer both together without much trouble.

Only thing I'd be curious about is how you handle the non-deterministic nature of these models. Sometimes they get the tool call right, sometimes they barf bad json. Does the suite run multiple trials?

by zambelli

5/20/2026 at 5:49:18 AM

[dead]

by deevus

5/20/2026 at 2:27:49 PM

If anyone else couldn’t find the working paper link (the readme and conf link didn’t work for me) it’s this one here: https://github.com/antoinezambelli/forge/blob/main/docs/forg...

by faizshah

5/20/2026 at 2:50:18 PM

Thank you! I've been trying to catch those replies and redirect people, but hopefully your comment be upvoted for others. Very embarrassing to put up the post with the wrong link lol.

by zambelli

5/19/2026 at 9:30:25 PM

Very cool work ! I'm running harness system myself and could measure improvement of token use of 2x to 10x on gsm8k only by running a math harness - i'm confident the future is bright for people who will know how to sell tech that is appropriately scaled to one's need. We absolutely do not need to run Claude 123 for most tasks and we better prepare for the rag-pull !

by 6r17

5/20/2026 at 12:26:55 PM

A while back when the latest Big Model came out, very impressive benchmarks, I tested it on some coding tasks.

I gave it 3 simple changes to make. It did it perfectly.

Then I tried with a much smaller model. It also did it perfectly, except 3x faster and 9x cheaper.

I used to think "best model" was what's at the top of the benchmarks, but for most tasks that just means you're going to wait longer and pay more money. The right model depends on the job.

(Also, speed itself is a feature -- when you get the really fast models, it enables a kind of real-time interactive usage that is otherwise not possible in the "alt tab and hope it's done" workflow.)

by andai

5/20/2026 at 1:01:25 PM

Definitely! A lot of tasks are within reach of small models, much more than people would think. Big models still shine in vague contexts or for breadth, or for very long running tasks, but yeah. The small ones just need help on longer multi-step workflows.

What small models have you used most/found most stable?

by zambelli

5/21/2026 at 2:21:48 PM

[flagged]

by RichardFF991

5/20/2026 at 7:48:23 AM

Hey this genuinely _fucks_, you're a legend. You can even get stupid good results from the 1 bit bonsai models! Plays v nice with lmstudio

It's now completely reasonable to throw a 7900XTX in a spare rig, put it in the basement, give it an absurd goal, and forget about it.

by monster_truck

5/20/2026 at 12:59:42 PM

Thanks! Did you try it with lmstudio? I actually never tried it with that. Only published ollama, llamfile, llama.cpp native/prompt - and unofficially tested vLLM, but never lmstudio.

by zambelli

5/21/2026 at 1:13:42 AM

Yessir! Been a longtime fan of it, I've spent too many fuckin years wrangling python, especially pytorch, especially on AMD, dep issues for fun and profit... they don't get enough flowers. It's oai compat, no thorns.

by monster_truck

5/20/2026 at 4:20:40 PM

I couldn't get it working with lmstudio. I have bonsai-8B running with llama-cpp and am attempting to build a harness for it. Looking good so far, I just got it started but Forge made tool calling work pretty quickly!

by tmjdev

5/20/2026 at 5:58:16 PM

Very cool! I'll try to get an issue open on lmstudio support and add it to the backlog.

by zambelli

5/20/2026 at 7:39:26 PM

Maybe you can update the guide and tell us how to use vLLM with Forge when you find some time?

by DeathArrow

5/20/2026 at 7:52:14 PM

[dead]

by zambelli

5/19/2026 at 10:53:59 PM

Something very similar I was experimenting with on, but had different results that you may be interested in, some of my findings were interesting

This was part of testing out how well a tool of mine worked (github.com/jsuppe/loom), which aims to be used to extracts requirements, specs, creates tests. At first I had no intention of using it for code generation but then tried it out with some early success. I tried splitting the work by using the tool with different frontier models, and then providing work to a local ollama instance running one of several models. Not all local models had the same outcome, not all coding languages had the same outcome. I also found in this experiment, when nailing down the coding tasks I wanted to set up positive and negative scenarios- which is where I found setting guardrails can sometimes backfire with inversion- this essentially elaborates on previous work by Khan 2025 (https://arxiv.org/abs/2510.22251); the most interesting finding to me was that if you give guardrails with a rationale, it reduces compliance and may cause the inversion

For coding tasks I found that the improvement was not only ability to use a lower cost model for these broken down tasks, but wall clock time was improved over using frontier model alone, with equivalent outcomes.

by 88j88

5/19/2026 at 11:18:33 PM

I've had a few reversions as well along the way, including in upcoming v0.7.0 patch. Some models benefitted, others regressed - overall better on harder scenarios or I wouldn't be releasing, but yeah - not intuitive.

The biggest challenge has been balancing the desire to hyper optimize for my favorite models, versus average behavior, versus consumer needs.

by zambelli

5/19/2026 at 8:32:01 PM

Impressive work, love seeing tools that boost local LLM reliability without touching the model itself

by Aleesha_hacker

5/19/2026 at 8:34:25 PM

Thank you! It was a really fun rabbit hole to fall into and I found a bunch of counterintuitive stuff.

I'm in the same boat, tuning models wasn't super interesting, though I might do a focused spike on behavior -focused fine tuning. But the harness matters almost more than the model in many cases.

by zambelli

5/19/2026 at 9:45:34 PM

Interestingly enough we have found the same net result -- structural guardrails are the unlock for smaller models. Our approach in particular layers three things: a parse rescue for malformed/incorrect tool calls (similar to your retry nudges), content-level intervention (diff size rejection, checkpoint forcing) and state machine enforcement on top (per-phase tool restriction, transition guards). On 13B models we saw completion of a selection of SWE-bench tasks went from ~20% to 100%. With frontier models we saw a reduction in API calls from reduced thrashing.

One of the most surprising findings was when a 9B model self-corrected through 4 tool parse failures within the guard rails. It tried to use a complex tool (patch_file), kept failing and eventually downshifted to a simpler tool (edit_line) that it could actually execute. The guardrails didn't make the model smarter, it just narrowed the execution space until it could find something that worked.

Brief: https://statewright.ai/research

by azurewraith

5/19/2026 at 9:57:08 PM

Nice! I'm not surprised at your findings (anymore). Mechanical reliability is the key to small models, and it's a big unlock. I've seen the same thing you just described. And the agnostic nudges forge sends at inspired by exactly that. Just show the model how it failed, gracefully, and it'll likely figure a way out of it itself.

Forge doesn't have a SWE-specific eval, but I've built a custom coding harness (not public yet but maybe soon) built on forge and saw the same behavior you seem to have seen in agentic coding.

by zambelli

5/20/2026 at 5:49:10 AM

Reads like processes guarding mediocre teams into higher probability of success? I can hear Alanis Morissette in my head now somehow...

by jeffreygoesto

5/20/2026 at 5:55:30 AM

Basically, yeah...this uplifts everything I've tested, but it's the small models that benefit most. A perfect model would get no benefit.

Mostly, I'm embarrassed I've done this whole public reveal without any use of Alanis Morissette anywhere in the work :/

by zambelli

5/20/2026 at 2:52:37 PM

Very nice! I also saw you have a vllm branch and I validated it works on my system. There is bugfix which sent you a PR for to auto-discover served-model-name which vllm hard validates for.

by sfifs

5/21/2026 at 4:13:58 AM

Merged! Thanks for that catch. I'll try to sequence the in-flight work ASAP to get the vllm branch merged in as a whole.

by zambelli

5/20/2026 at 3:35:01 PM

Oh, awesome! I'll take a look.

by zambelli

5/20/2026 at 2:53:09 PM

Sounds like an implementation of the discussion[0] spawned by this[1] article. I've been thinking about the best way to implement such a system ever since seeing that. I'm going to try this out.

0. https://news.ycombinator.com/item?id=48051562

1. https://bsuh.bearblog.dev/agents-need-control-flow/

by alsetmusic

5/20/2026 at 3:34:21 PM

[dead]

by zambelli

5/19/2026 at 7:48:06 PM

What are "guardrails" in this context? Is it correctly understood that this would sit between my pi agent and llama-server, and it would do what exactly?

by tommica

5/19/2026 at 7:51:38 PM

It would help ensure that the model executes its tool call correctly. So if you give Pi a task like booking travel... Pi decides to book a flight, hotel, car. It gets the flight in one go, but then sends "here is the payload : [json blob]" to hotel booking API and the whole thing throws an error and the workflow dies, with partial completion. Forge would catch the error and nudge the model by injecting a message into the conversation history, with a helpful error message "You replied with text, you must call a tool", the model reads it, and submits a tool call.

Big frontier models need this less than small models.

by zambelli

5/20/2026 at 4:49:54 AM

Nice explanation, thank you.

So basically the kind of thing I'd usually be doing manually with small models, over and over again, you just automate that nudging and off they go.

Sometimes LLMs have seemed to me like "computer programs with inertia" and in that frame what your tool does is identify and reduce friction at key points so the wheels can keep spinning.

by blurbleblurble

5/20/2026 at 5:22:51 AM

Yep! The big frontier models are already quite good at doing that, and they have decent harnesses. That's why Opus on Claude Code does what it does.

Small models aren't there yet and they would veer off course, this just nudges them back onto the road. Whether or not they have a good sense of direction is a different question.

by zambelli

5/20/2026 at 7:27:44 AM

Really nice intuition, thank you.

by blurbleblurble

5/20/2026 at 12:12:51 PM

When it comes to the business logic of production use, this particular failure type is less obvious compared to benchmarking tasks. Benchmarking involves having the answer already known — it helps detect mismatches easily. Business logic pipeline does not. If LLM gives out a valid output that happens to be semantically incorrect, the pipeline goes through. There is no mistake to catch.

Created a dedupe pipeline where an LLM decides whether two feature requests are similar enough to merge. Occasional mistakes in terms of false positives — valid JSON structure, but incorrectly assessed similarity. In this case, it didn’t help to implement the retry technique. The solution was implementing a deterministic gate validating the output of the model based on its semantic similarity score calculated separately.

The reason why recovery works only with the help of additional tools when the error rate is at zero percent becomes clear: the LLM does not recognize the fact that it made a mistake. The guardrail becomes necessary for that — the retry is just one way of implementing the guardrail concept.

by luodaint

5/20/2026 at 12:50:12 PM

Definitely, there's several failure modes and Forge doesn't address all of them. This is just one tool in the toolbox to getting things stable enough for production use at reduced costs.

Forge sits one level lower - in my mind - than a gate which would sit more at the workflow level. Perfectly complementary.

by zambelli

5/19/2026 at 9:58:17 PM

Maybe I am reading it wrong but I don't think this does what it claim it does or at least how it sounds.

Basically this is a tool auto-complete that has a workflow element to it with certain steps that need to happen in certain order. In other words the order is defined in advance. Am I correct?

Basically execute step 1 first, then step 2 and finally step 3 and this is the schema for each step. That is effectively the guardrail and there is retry logic.

If it is the case, this is obviously useful but in a very specific set of problems where the solution is kind of known in advance. A workflow automation might work but this is kind of N8N where each step is LLM step.

Anyway, I might me wrong but I wanted to share a few thoughts.

by _pdp_

5/19/2026 at 10:02:22 PM

Partially correct, but an important distinction to call out.

You don't have to define the workflow steps. You can just expose the set of tools to the model and let the LLM call whatever it wants in any order, and every guardrail except the prerequisite step enforcement is still there to help.

If your workflow does have step enforcement, that can also be conditional. For example like Claude code does read required before edit. You can define a conditional enforcement where the agent must have called read before edit, and even force the same file path. That doesn't mean the model has to call edit at all...

But maybe I could have been clearer in the docs on the workflow pieces.

by zambelli

5/19/2026 at 10:12:19 PM

The docs should start with that with a very clean explanation how it works. Basically first paragraph. :)

Otherwise you should expect churn.

But also it should really go into some detail how is this different from tool calls with type enforcement on expected parameters.

by _pdp_

5/19/2026 at 10:15:27 PM

That's good feedback, thank you! I have an update landing shortly so I'll make sure to clarify in the docs! I appreciate it!

by zambelli

5/19/2026 at 10:04:22 PM

Funny timing. I’ve been building something adjacent, though from a different angle: not primarily local-model reliability, but a control layer around agent execution, tools, routing, and operator intent. I was calling these "synthetic models", but decided yesterday "LLM middleware" is a clearer description.

Very early prototype, so I’m looking more for architectural/conceptual reactions than polish: https://wardwright.dev / https://github.com/bglusman/wardwright

The common thread I see is treating the harness around the model as first-class infrastructure. Forge seems focused on tool-call correctness and recovery; Wardwright is more about controlling what the agent is supposed to do, where work gets routed, and how the operator stays in the loop.

Curious whether you see those as complementary layers. I’m planning to try Forge and would be interested in seeing whether they fit together cleanly.

by bglusman

5/19/2026 at 10:07:40 PM

Conceptually I think definitely! Forge has no opinion on what the agent should be trying to do, that's the "middleware"'s job, so to speak.

Forge is just trying to make sure that when the model decides to do something, thee execution is reliable.

As for software integration, let me know if you run into any issues and I'll be happy to take a look or try to patch something!

Harnesses as first class infra all the way. I'll take a look at your work and see if I spot any obvious tensions.

by zambelli

5/20/2026 at 1:31:36 AM

I've just read through your readme and I have zero clue what this does. Something about proxying model calls and applying "policies" to them? But what kind of things does it actually do, what benefits are there? That should be at the top of the readme.

by esperent

5/20/2026 at 2:21:44 AM

I'm sorry to hear that! I'll take a fresh look at docs in my upcoming release.

In a nutshell, it applies guardrails around LLM calls to make them more reliable - specifically small models but works on all: "on multi-step agentic workflows through guardrails (rescue parsing, retry nudges, step enforcement) and context management (VRAM-aware budgets, tiered compaction).".

It'll try to parse malformed tool calls, it'll automatically compact if needed, it'll enforce any workflow requirements you define (ie, read before edit) - and it does so with domain-agnostic guardrails. It catches and feeds errors back to the model in a structured way so the model self-corrects (hopefully).

Each guardrail can be removed as desired by a consumer. It can be used as a building block library (WorkflowRunner approach), it can be integrated into existing source (middleware), or it can be a drop-in addition to an exiting workflow (proxy mode).

by zambelli

5/20/2026 at 3:20:34 AM

I think that comment was aimed at my Wardwright link, not Forge, given mention of policies and proxying model calls! I think your docs are in much better shape ;-)

by bglusman

5/20/2026 at 3:36:35 AM

lol - my bad! but thanks!

by zambelli

5/20/2026 at 6:08:34 AM

Yes it was for wardright, sorry for the confusion. Your forge explanation is clear.

by esperent

5/20/2026 at 3:19:00 AM

[flagged]

by bglusman

5/19/2026 at 11:09:22 PM

Ironically, the project this idea emerged out of for me is also called Forge, actually Calciforge… https://calciforge.org / https://github.com/bglusman/calciforge

Name was just a portmanteau of Calcifer's forge, because Howl’s moving castle seemed like a good metaphor for what I was trying to do… I had synthetic models as apiece there but I realized a) it was out of place and b) it was my favorite feature there

by bglusman

5/19/2026 at 7:50:42 PM

So, this basically ensures that models call the right tools with the correct format?

by k__

5/19/2026 at 7:52:54 PM

In a nutshell, yes. It tries to anyways, but at the end of the day, some models get stuck and you hit a max iterations error that forge will raise, with some context, and the consumer can choose what it wants to do at that point.

by zambelli

5/19/2026 at 7:54:28 PM

Ah, so it a "smart" retry mechanism?

by k__

5/19/2026 at 7:57:25 PM

I'd like to think so! ;). It has some brains, but the key insight was to send the model domain-agnostic nudges. I don't need to know what you're trying to do, the LLM already knows, I just need to nudge it back on the structural side: text response vs tool call, arg mismatch, etc. and let its knowledge of the context fill in the blanks (otherwise I'd need a massive library of every possible failure mode).

The other insight was doing it at tool call level and not workflow level, which addresses the compounding math problem more directly.

by zambelli

5/19/2026 at 9:12:43 PM

Maybe similar to Instructor [1] which was a cool tool for json and structured output enforcement combining pydandic with ai retry loops very handy for when models don't have that covered

[1] https://github.com/567-labs/instructor

by jimmySixDOF

5/19/2026 at 9:35:01 PM

Interesting! I'll look into that. Would mean another dep/integration but might be more robust.

by zambelli

5/20/2026 at 4:35:46 AM

Very cool work! Regarding your finding "the tool ran successfully and returned data" and "the tool ran successfully but found nothing." Couldn’t this be solved by designing better tool responses instead of adding another layer in between? Just curious and probing my understanding.

by Imanari

5/20/2026 at 5:21:03 AM

100%, a better tool would work or even remove the problem overall.

The isssue/use-case is more around, say, a database table or legacy systems where your tool is just hitting a legacy API that may or may not be good. A surface you don't control.

It didn't come up as a use-case in this eval honestly, it's more the concept of a standard, like 4xx vs 5xx. I just felt it was missing from the ecosystem overall.

by zambelli

5/19/2026 at 11:02:19 PM

Why this entire tool chain instead of building within something like pi code?

I've been exploring this area and a project like https://github.com/itayinbarr/little-coder (not my work) lets me mix and match with my current setup or any plugins built for pi.

by tempoponet

5/19/2026 at 11:23:19 PM

Mainly because I have plenty of use cases and not all of them need or want pi. Forge isn't an orchestration framework and is not coding specific, it lives one level lower - if I understand pi correctly.

The proxy mode should integrate seamlessly, and the middleware guardrail mode could be lifted into pi.

As for little coder, I love it! I wanted forge to be more generic than just agentic coding as there's many more agentic workflows worth optimizing with small models.

by zambelli

5/20/2026 at 10:15:45 PM

Thank you for the thoughtful reply. I have some smaller 3080's I'm looking to place and this sounds like a good opportunity.

by tempoponet

5/20/2026 at 4:30:13 PM

This seems similar to what I done using llama.cpp's "Grammar constrained generation" for my local agents. But using that instead of catching and retrying it is just literally impossible for the LLM to generate something that doesn't match a specific schema of tool choices. It is amazing how much better small models can be when you reduce the problem space to only grammatically correct answers.

by peer0

5/20/2026 at 4:55:00 PM

Interesting, catching the problem upstream, effectively. How did you enforce the grammar?

by zambelli

5/20/2026 at 5:43:04 PM

https://github.com/ggml-org/llama.cpp/blob/master/grammars/R...

llama.cpp supports grammar limiting using either GBNF or json schema (It just translate it to GBNF behind the scenes I think). So I have my harness generate a tool schema on the fly (based on what tools are possible for the current task) and pass it in at request time.

by peer0

5/20/2026 at 5:59:29 PM

Oh, interesting - thanks for the link. I really haven't explored this but it should slot in fairly easily I think? Gotta dig into it more.

by zambelli

5/20/2026 at 5:51:10 AM

I think im missing something, don't all harnesses (opencode, pi, etc) already do stuff like "retry"? As far as I can see, when a tool call fails in either, the model gets the error back to correct.

by DavyJone

5/20/2026 at 6:09:22 AM

Yes and no.

Harnesses do have retry mechanisms. In opencode in particular, I think they return the error as-is to the model in the next turn. But that's slightly different. Harness retries come mostly in two flavors:

1) provider-layer: HTTP requests to cloud retries, with or without exponential backoff. It covers you for transient network hiccups or rate limits, and a big Opus model really doesn't need more than that.

2) sort of a hope-and-pray retry. Tool ran, returned an error string of some kind, gets fed into model as-is, and the model is expected to read the error message and self-correct with no guidance. This is fine for frontier, and even some of the large oss models. They have the context-following capabilities needed. For smaller models, this won't be enough, not reliably over many turns.

- if model outputs malformed json, provider will reject it before it even reaches the tool, the error loop is broken. A rescue parser handles that - can be ~5-15% of calls on a small model sometimes.

- model calls the wrong tool, correctly, then proceeds confidently with context that won't help it. step enforcement can help here.

- model terminates prematurely, thinking it's done. prerequisite enforcement can help here (say, forcing the model to call pytest before declaring the feature built).

- Escalating nudge messages, that specifically nudge. Just returning error messages doesn't tell the model what to do, it just tells it it was wrong. A message that spells out "tool X does not exist, call one of the available tools: A, B, C" is more helpful to a small model than "error: X not found".

So, in short - yes, retries exist in harnesses, but rely on top-tier model interpretation of the error messages. When working with top models, there's likely no real difference, or a minor one (see Opus bare vs Opus reforged). But Forge provides a more hardened suite of guardrails that are effectively necessary for small models.

by zambelli

5/20/2026 at 9:37:54 AM

Nice work! I've worked through all kinds of local models, very extensively for a week on an NVidia Spark. Gemma and Qwen, quantized, somewhat shine but the results overall compared to say a Claude Haiku were so disappointing (in context of tool calling) that I ended up returning the hardware. I'm curious how the same local models and benchmarks I have will hold up, will try this.

by philipp-gayret

5/20/2026 at 1:03:51 PM

Good luck! Frontier models are called frontier for a reason. I've seen Forge get local models close to frontier on these evals, even beat it in some cases, but frontier still has an edge overall - no denying it.

The key I think is to look at what use cases you have that aren't big monsters. Auditing logs, home assistant, reading and summarizing news rss feeds, etc...stuff that's fairly bite-sized per task, but high volume. Then the local models make sense and they just need mechanical reliability to close the gap.

by zambelli

5/20/2026 at 6:19:51 AM

It’s really strange to see a project I really fundamentally agree from the same author of a project I fundamentally disagree with. How is it that you want to remove watermarks from AI generated images while also making AI more a more reliable partner? I am not trying to be combative or accusatory, just am curious about your world view an open to an argument that removing the origin of AI generated images isn’t an existentially dangerous act.

by digitaltrees

5/20/2026 at 6:27:11 AM

I think you have me confused with someone else. I haven't worked on any AI watermark removal project.

What project are you referring to specifically?

by zambelli

5/20/2026 at 6:57:24 AM

You’re right. I followed another hacker news thread to the git repo of the watermark removal project and saw your name and seem to have wrongly connected you as an author of both projects.

Seriously awesome concept to what you did build I will test it out.

If you’re interested, I have sponsored research on AI reliability with Duke University (my graduate Alma mater) and there is an active research project this might be a good fit for if your interested in participating.

by digitaltrees

5/20/2026 at 7:01:17 AM

Oh neat! I'm a little slammed the next couple of weeks with an existing engagement + CAIS, but happy to connect and see if timelines work out?

by zambelli

5/20/2026 at 7:02:58 AM

https://calendly.com/ryanwmartin/open-office-hours

No matter what, keep up the good work. ;)

by digitaltrees

5/20/2026 at 1:04:14 AM

This is a neat project, but the description made me realize that I don't actually know what the term "guardrails" means.

... which lead me to realize that it's one of those terms with multiple meanings - like "agent" or even "AI" itself - but where people who use it may not be aware of how many different definitions are floating around.

In this project it refers to validating tool calls - fixing invalid tool responses, making sure certain required tool calls have been made, maintaining an error budget after which the task is abandoned with an error.

Other projects might use "guardrails" to mean protecting against unsafe content (Llama Gaurd), refusing off-topic queries (NVIDIA NeMo Guardrails "topical rails", filtering PII, detecting jailbreaks, or human-in-the-loop checks of specific actions.

I've even seen people talk about running a coding agent in a sandbox (Docker, Firecracker etc) as a form of guardrail.

by simonw

5/20/2026 at 3:47:48 PM

Yes, "guardrails" is a squishy term. But it gets clearer if you ask what transition is being guarded.

Some of this is inside the model, like topic refusals. Forge sits at the tool call level.

My personal workflow uses guardrails at the SDLC level: I have a standard pipeline (plan, design, code, build, test). I use gates between each stage, and the right composition leads to a much higher quality in the final product.

Also worth mentioning that gate failures are given to the agent that produced the artifact, so it has a chance to fix it. That means that I don't have to review obviously wrong output.

by mrothroc

5/20/2026 at 7:04:04 PM

Nice symmetry with tool call failures being sent to LLM that made the call without bugging the user. The artifact-generating entity gets the error back, effectively.

100% correct, and stackable. Could have topic refusal in LLM training itself, forge in tool call alter, and sdlc gates at the workflow level.

by zambelli

5/21/2026 at 1:45:49 PM

Definitely stacks. The thing that made it clear for me was being explicit about the stages, and where/what you can verify with a guardrail, or gate. I wrote up the framework I use here: https://michael.roth.rocks/research/trust-topology/

Being explicit about the space between the stages is critical, because that's your enforcement point.

by mrothroc

5/21/2026 at 6:49:30 PM

This is a really neat writeup, and the empirical data for coding agents is super useful. Will take a closer read and see if there's anything I easily lift into my harness!

by zambelli

5/20/2026 at 1:07:59 AM

That's a fair point, and frankly something that might not age well in my docs one day. I genuinely don't know what the industry will standardize on when it comes to the use of the term "guardrails". I've seen the sec definitions as well.

You're 100% right about how I meant it and what it means within Forge though, but it's something that might lead to doc changes as things evolve.

by zambelli

5/20/2026 at 3:49:42 AM

I'm thinking of it like a guardrail that keeps your car from driving off the edge of a road, but in this case, it keeps your tool calls from driving off a cliff.

by trollbridge

5/20/2026 at 1:14:18 PM

[flagged]

by Charles389no

5/19/2026 at 9:42:48 PM

> # External mode — you manage llama-server, forge proxies it

> python -m forge.proxy --backend-url http://localhost:8080 --port 8081

This is a good example because I've currently stuck with llama.cpp's UI. I can read your code (or throw Gemma at it =p ) but thought I'd ask anyway.

In this example, what is it exactly that your proxy is fortifying? The HTTP SSE requests? (Those would be `/chat/completions`.)

by nzeid

5/19/2026 at 9:49:29 PM

Yes that's correct !

/v1/chat/completions is the entry point.

In proxy mode, here's what forge applies on each request (handler.py builds these):

Response validation: ResponseValidator(tool_names) checks each tool call against the declared tools array. If the model emits a call to a name not in tools[], or a malformed call shape, it's caught before the response goes back.

Rescue parsing: When the model emits tool calls in the wrong format — JSON in a code fence, [TOOL_CALLS]name{args} (Mistral), <tool_call>...</tool_call> (Qwen XML) — rescue parsers extract the structured call and re-emit it in the canonical OpenAI tool_calls schema. This is the biggest practical lift, especially on Mistral-family models that ignore native FC and emit their own bracket syntax.

Retry loop with error tracking: ErrorTracker(max_retries=N) — if validation fails, forge retries inference up to N times with a corrective tool-result message on the canonical channel, rather than returning a malformed response to your caller. From your perspective the proxy looks like a single request that just took a few extra ms.

What proxy mode does NOT do (because it's single-shot, not multi-turn): prerequisite/step enforcement (those need a workflow definition spanning turns), context compaction, session memory. For that surface you wrap the WorkflowRunner class in Python — proxy mode trades that depth for "use forge with your existing setup, no Python rewrite."

So yes — the proxy is fortifying the response shape and retry behavior of /v1/chat/completions. The full agentic guardrails are at the Python class level above it.

For greenfield projects, I've been building on forge native using WorkflowRunner so I get all guardrails. But obviously as a drop-in replacement in existing systems then proxy is the way to go.

by zambelli

5/19/2026 at 9:53:10 PM

the funniest thing I see in opencode with tool calling is the model calls 10.0 and opencode says it's an error because the spec is an integer, even though it's obvious to anyone that if a float can be coerced properly to a integer, then that should be a success.

by cyanydeez

5/19/2026 at 9:58:41 PM

Yeah it's a delicate balance between precise and silly, and too permissive.

I'm definitely still iterating on forge, but so far sending the model a friendly and gracefully handled error message works wonders (instead of barfing a stack trace or something).

by zambelli

5/20/2026 at 7:53:35 AM

Really cool direction. For folks thinking about the “agent safety” stack more broadly, this feels complementary to things like Kontext’s kontext-cli (github.com/kontext-dev/kontext-cli) and OneCLI (github.com/onecli/onecli)

by mc-serious

5/20/2026 at 1:15:46 PM

Yeah I would think so!

A lot of current tooling is layered mostly at the workflow level. Auth for the agent, or memory management for the agent (like some smart skills stuff), but Forge sits below that.

In most cases I've looked at, it could be slotted in with other work without much disruption. Forge just increases mechanical reliability of tool-calling, it shouldn't disrupt your workflow-level layers much.

by zambelli

5/19/2026 at 10:07:04 PM

This seems pretty awesome; being able to use an 8B model for tool calling would be perfect.

Interested in using this for Home Assistant using a Mac Mini as my server. Does it run on MacOS?

How is the latency when using the proxy? I’m using Claude Haiku 4.5 for my voice assistant right now and it’s pretty fast, but if I could keep the LLM local, it’d be even better.

by jamesponddotco

5/19/2026 at 10:10:58 PM

I have an open GitHub issue for macOS hardware detection. I don't have a Mac myself to do dev on but happy to accept a fork! I did assign a buddy to that issue but she's been slacking - call her out :p.

Latency is dependent on the guardrails firing, effectively. If nothing fires, it's a passthrough, for all intents and purposes, very little overhead. But if a retry nudge fires then that's another LLM call.

As a consumer for a home assistant, a retry nudge firing is something I'd catch, and have my voice model output a pre-baked "one sec, trying again" sort of filler message or something.

by zambelli

5/19/2026 at 7:02:18 PM

Happy to answer questions about the eval methodology, the backend findings, or anything in the repo. I'll be around.

by zambelli

5/19/2026 at 9:02:57 PM

super interesting work. It will take me a few days to dig in and really understand it. But I'm looking forward to it.

I run small models at home, so I'm very curious.

by schaefer

5/19/2026 at 9:33:09 PM

That's awesome! Let me know if quick start is causing issues or anything else you'd like to dig into.

Out of curiosity, what models are you running?

by zambelli

5/19/2026 at 7:43:15 PM

dashboard link is dead

by fabian_shipamax

5/19/2026 at 7:48:07 PM

Does this work? https://github.com/antoinezambelli/forge/tree/main/docs/resu...

by zambelli

5/19/2026 at 8:08:35 PM

yes, that link works for me.

by schaefer

5/21/2026 at 2:45:46 AM

one week in and already got a custom gui frontend going, that's fast

by ElenaDaibunny

5/20/2026 at 6:19:15 AM

So, I experimented a little bit with smaller models and the problem I faced is that it would simply not call a tool that is available, but instead just describe the tool. Is this something that Forge can help with?

by _fizz_buzz_

5/20/2026 at 6:38:49 AM

Within limits, yes. Forge has escalating nudges that will tell the model effectively "stop responding with text, you MUST call a tool" vibes. If the model is emitting something like "ok, let me call the tool: [valid json tool call in the middle of prose]" then we catch it with rescue parsing.

But at the end of the day, if the model keeps responding with text, there's nothing forge can do. I've run into that failure mode for sure, even with forge.

That works well enough for all the models shown in the eval here: relatively modern 8B+ models.

But some of the older generation (mistral 7b, that sort of thing) still can't be reliably used in something like a production setting.

by zambelli

5/20/2026 at 10:42:46 AM

sorry if it's a stupid question, but isn't generating valid json tool call in the middle of prose the way tool calling works? what is that missing?

by Dansvidania

5/20/2026 at 1:23:45 PM

Not stupid at all!

Some of the older models did do this (like 3.5-era ish I think), and the harness would parse the results.

The newer way frontier has setup is structured tool calls. `tool_use` or `tool_calls`. The response is then received as a different tool_result rather than a regular message. That's a bit of the newer way of doing it.

The failure mode in question is more the model mixing the two: "Sure, I'll read the file: {"tool": "read", "args": {"path": "foo"}}" - that'll break stuff. Other failure modes are the json not parsing when sent it as a structured call, and in some cases the model just emitting text and forgetting the tool call.

by zambelli

5/19/2026 at 9:35:01 PM

How does this differ from dottxt's Outlines[0] on the technical level? Are you using some JSON grammar to force the LM head distribution to follow it?

[0]: https://github.com/dottxt-ai/outlines

by lucrbvi

5/19/2026 at 9:43:43 PM

I only just skimmed it, but will try to dive deeper in a bit.

I think we share a lot on tool definitions/schemas. Forge will let a consumer define a tool, set of tools, pydantic schema for each, etc. outlines seems to be similar with their task definition.

I think where we differ is what happens when that doesn't work...and the model still doesn't get the contract right. Something like a pydantic-valid string path for glob, that points to a non-existent thing. Glob will error, forge catches, and nudges the model. Forge does very little model output manipulation (just a basic regex parse to try to find json/XML), the core of it is in the retry mechanisms.

Once I dig into it more I'll try to highlight other deltas.

by zambelli

5/20/2026 at 5:23:26 PM

I was hoping it would work with vLLM (openai compatible) to test it, does anyone know a similar proxy for local coding models?

by somethingsome

5/20/2026 at 8:07:37 PM

Check this: https://github.com/antoinezambelli/forge/tree/az/vllm

by DeathArrow

5/20/2026 at 8:16:05 PM

Yeah I got it working as a quick test run to confirm a model issue vs backend issue on a consumer app. It worked on my dual-5070 Ti rig, but I didn't have time to formalize all the way and merge it in. Thanks for linking it!

by zambelli

5/20/2026 at 9:33:10 PM

Thanks, I just tried, for me it worked on 2x L40S with vLLM. I had some issues due to the model name, forge was forwarding 'default' instead of the real model name 'Qwen2.5-Coder-14B-Instruct'.

If someone else struggle on this step, I added in vLLM args: --served-model-name "Qwen2.5-Coder-14B-Instruct" --served-model-name "default"

So default becomes an alias.

I didn't yet test Forge, I was just happy that it worked at the moment ;)

by somethingsome

5/20/2026 at 10:05:37 PM

Oh that's a good find, I'll book ark this for a GitHub issue.

Glad to hear it's working!

by zambelli

5/19/2026 at 8:15:42 PM

Hello. Interesting project! Haven't gone through it yet, but want to consider using this in my CS master's capstone. While you have benchmarks I may create my own specific scenarios and comparisons vis-a-vis hosted inference to highlight specific economic benefit. Any suggestions?

by dpweb

5/19/2026 at 8:21:10 PM

Very cool! I would look at the tokens returned by each of the calls. You can map those to API costs per input/output tokens. Forge should be capturing those (or can, as passthrough from llama.cpp).

At least, if I understand your economic benefit angle correctly.

For scenarios to get inspired by I'd look at those tagged "model_quality" or "advanced_reasoning".

by zambelli

5/19/2026 at 8:30:26 PM

Hey I'm really impressed and hoping to connect. I followed you on X just now, is that a decent place to shoot you a DM? I don't want anything from you, we just seem to be working on similar things (I'm working on our internal agent harness here, at a healthcare startup).

by mholubowski

5/19/2026 at 8:32:23 PM

Neat! Historically I've been most active on LinkedIn but the AI community seems very X-leaning so I'll make sure to pay closer attention there. Good luck with the harness, happy to connect!

by zambelli

5/20/2026 at 6:46:01 AM

I was wondering that will modifying prompts or contracting the context also impact the performance? It may mistake the original meaning, and these steps also need help from external LLM.

by momo26

5/20/2026 at 6:51:59 AM

Forge doesn't modify the prompt, it just injects information into the conversation as if it was a conversation turn. Over many turns - it can degrade the model (a concept I'm calling "effective attention"). But that requires serious context growth that really only becomes relevant for long-running agentic coding tasks in my experience. Still, it's possible.

Context compaction can also affect the outcome - I have eval scenarios for that as well but not in the published set, only in the repo. For those, I'd say "it's better than nothing". If you hit max context, the whole thing will barf or OOM the rig or something like that. So compaction degrades performance versus some theoretical ideal where you never need to, certainly. But it's better than a hard failure. Eval on those scenarios showed increasing degradation depending on severity of compaction. I view the auto-compaction as insurance. I never give the models tasks that will require that much context, but if it ends up getting there then the run might be saved.

by zambelli

5/20/2026 at 6:29:08 AM

Probably in same league https://github.com/Doorman11991/smallcode

by Jayakumark

5/20/2026 at 6:34:29 AM

I think there's certainly overlap there - and I love to see small local models being leveraged!

I do think there's some differences though. The biggest one being that forge isn't a coding harness, it's a guardrail primitive, really. Applicable to any tool-calling workflow.

As for the errors, are you nudging or passing errors back or swallowing them completely? Love the 2-stage routing though, neat!

by zambelli

5/19/2026 at 10:02:43 PM

Curious if this would help larger local models? Qwen 3.6 varieties of deepseek4?

by __mharrison__

5/19/2026 at 10:05:22 PM

Yes it does! I haven't published those evals yet, but I'm actually running 24-35B class models on a custom coding harness built on forge (even 120B class recently).

I just need more GPU wall clock time to get more evals done. ETA is...a few weeks? Got distracted by the coding harness.

But the results are the same. Reforged models do better than bare, even at those sizes. As for published results, I ran forge on Anthropic models and reforged doe better than bare for them as well :)

by zambelli

5/20/2026 at 1:00:08 AM

>But the results are the same. Reforged models do better than bare, even at those sizes

>I haven't published those evals yet

Don't forget to post the complete settings for those evals, please, because local LLMs' failure modes are often caused by incorrect setups (bad quants, bad chat templates, non-recommended temperatures, ridiculously small context, not enabling "preserve thinking" etc.). In my setup I've never seen Qwen3.6-27b get truly stuck so far. What it usually gets wrong are poor architectural decisions or forgetting to update something.

by kgeist

5/20/2026 at 2:29:27 AM

Good call! The latest forge version has per-model-parameter configs sourced from official sources (can be overridden), that's what I'll use for evals and each eval set will be paired with a commit hash. But I'll make sure to call out the location of the params and maybe highlight some for the popular models.

For the paper - more academic in nature - I wanted to isolate the model performance variable from guardrail lift. The delta is what mattered more than final score. For the paper, everyone got temp=0.7 - that was intentional.

As for Qwen3.6, it's really solid. It'll do really well on forge I can call that now. When I pushed it into agentic coding specifically and the eval suite I use there (separate from forge), even it needed help on long-running tasks - but it's definitely a top model right now.

However, entirely possible there are better settings than the "official recommendations" I found - which would be a neat finding in itself.

by zambelli

5/19/2026 at 10:19:10 PM

If it's worth it to you, you could try running it on Deepseek v4 flash which is very cheap right now...

by happycube

5/19/2026 at 10:33:50 PM

Exactly what I was thinking - even on frontier or near-frontier models I still see my agents get stuck in these pointless loops where it's very obvious to me what they need to do to get "unstuck".

by trollbridge

5/19/2026 at 11:54:02 PM

Yeah, it's a useful framework even with frontier. And it definitely lifts "cheap" frontier models like Haiku into more solid territory. I haven't done a ton of forge integrations into frontier (like pointing claude code into proxy mode) yet, but if you run into any issues let me know!

by zambelli

5/20/2026 at 3:48:28 AM

And we're off! It's working great with DeepSeek V4, although DeepSeek V4 Pro tends not to really run into problems anyway being near-frontier, but I definitely see improvement with Flash.

by trollbridge

5/20/2026 at 6:09:07 AM

Hi! I'm using DeepSeek V4 Flash on high via opencode.

It should work with opencode using the proxy server or middleware method right? Any tips?

Does this need a GPU to work? Or is it CPU only? I ask because I plan to try to run this using Docker. But I have a modest RTX 5070 12GB VRAM.

Or maybe I could use opencode as a remote backend too?

I'm thinking of trying the OpenAI-compatible provider route: https://opencode.ai/docs/providers/#custom-provider

by bel8

5/20/2026 at 3:56:13 AM

That was fast! It's great to hear it's working well :)

Did you notice any particular guardrails firing? Always curious about things I haven't tested on - especially if it has a different shape.

by zambelli

5/20/2026 at 2:18:51 AM

I'm attempting to make a replica of your Anthropic method that will do the same for DeepSeek. I'll let you know how it goes.

For our local Qwen, your setup works great out of the box!

by trollbridge

5/20/2026 at 5:17:32 AM

Deja vu from the other week https://news.ycombinator.com/item?id=48051562

by blurbleblurble

5/20/2026 at 5:53:12 AM

I think I'm aligned with the idea that some parts of some workflows are mandatory - auth, read before edit, etc.

But otherwise, forge really doesn't own or opine much of the workflow. Step enforcement exists if you want it, so do prerequisites, but the idea is that those could be conditional or optional (you may never need to edit a file).

The guardrails are designed to work for non deterministic flows or deterministic ones. In the latter, you just might not have one of the guardrails active. It's much more about nudging the model back on track than laying more obvious tracks, in a sense.

Overall, agentic reliability is definitely an active field.

by zambelli

5/20/2026 at 7:16:52 AM

In this blog post I'm reading their call for "control flow" as a generalization of exactly what your work illustrates so nicely.

The blog post doesn't say to me "we need to start encoding specifically opinionated conditional branching statements that guide the model" rather I'm hearing a call to realize the broader principles of control flow itself relevant for composing programs with LLMs.

I think your work "nudges" us in that direction.

by blurbleblurble

5/20/2026 at 1:25:50 PM

Nice ;). I'll take a closer read of it, that's on me - I am definitely seeing more people looking in this direction as agents start to ramp in production at the enterprise level, which I suspect is highlighting some of these failure modes at higher stakes. And also the cloud frontier API bills.

by zambelli

5/19/2026 at 8:04:48 PM

I'd be curious about the eval methodology. In production coding tasks, the gap between benchmark scores and actual workflow integration can be significant. What does the error recovery loop look like?

by xiaod

5/19/2026 at 8:16:20 PM

Absolutely, benchmarks are a different breed. Forge's eval is deliberately scoped as a stress test of the recovery loop, not a measure of end-to-end agentic quality.

Scenarios range from basic 2-step workflows, to more complex ones with dead ends, breadcrumbs, misleading names.

Concrete example: Task: get, analyze and report on Q3 sales data.

Model emits: analyze_sales(quarter="Q3"). This skipped the fetch step. Forge's response validator catches it before the tool function runs. Instead of letting the bad call hit the real impl (which would error or hallucinate), forge replies on the canonical tool-result channel.

We send this to the model: tool_result: [PrereqError] analyze_sales requires fetch_sales_data to be called first. Available next steps: fetch_sales_data

Model emits a corrected fetch_sales_data(...) on the next turn.

Three enforcement paths use this same channel: prerequisite violations, premature terminal calls, unknown-tool retries.

We also have rescue parsing for known templates (Jason OpenAI style, XML like granite, etc) where we try to parse tool calls that might be malformed.

And lastly bare text response nudges. Small models love to chat, we need them to call tools!

by zambelli

5/20/2026 at 6:45:30 AM

Very unrelated: first time I see someone with the same last name as me in the tech community, it's somewhat odd :)

by valzam

5/20/2026 at 6:48:34 AM

The sound of it has something very gracious (or maybe it's the Italian vibe) :D

by rvnx

5/20/2026 at 10:57:04 AM

Nice work, thank you very much! Did you develop that as part of your work at TI or on your own time?

by raxxorraxor

5/20/2026 at 1:12:03 PM

Thanks! No this was my own time, just evenings and weekends - life-permitting.

by zambelli

5/19/2026 at 11:17:08 PM

I've been working on the same thing and even nearly called it forge. Instead I called it hammer.

I'll be keen to look through the code on this!

by tim-projects

5/19/2026 at 11:25:27 PM

Oh no! I have code-hammer coming out soon :D. Everyone is building stuff these days :p.

Always happy to see folks looking into small local models!

by zambelli

5/20/2026 at 7:52:22 AM

I went to view the dashboard and it is getting a github 404 error, just thought you should know.

by ApolloRising

5/20/2026 at 1:24:23 PM

I know :( - I posted the wrong link and now it's there forever.

Dashboard is in here: https://github.com/antoinezambelli/forge/tree/main/docs/resu...

by zambelli

5/20/2026 at 2:59:24 AM

guardrails this well-designed matter way more than just throwing bigger models at agent tasks tbh

by ElenaDaibunny

5/20/2026 at 3:03:53 AM

Thank you! I completely agree - especially for always-on systems like agents crawling databases or doing audits and the like. The sheer volume of calls will be enormous and being able to run it on simple hardware with a small model that fits instantly changes the economics of it.

Plus it's cool to see a little 8B model writing code :)

by zambelli

5/19/2026 at 9:19:28 PM

Hi Antoine!

Interesting point about backend variance. Do you think serving layer should become part of standard LLM eval reporting?

by rebekkamikkoa

5/19/2026 at 9:34:12 PM

Hi! Yes, I definitely think so. I've seen variance across all model families I looked at. The magnitude changes, but the presence of variance is a constant.

by zambelli

5/20/2026 at 2:25:50 AM

Would putting this between a small model and an agent like Hermes improve performance?

by roger_

5/20/2026 at 2:36:30 AM

I haven't specifically tested this with Hermes, but I would expect so. Hermes is orchestrating things - it decides it needs to...whatever you want, book a trip for you. Forge will help make sure that the API calls to hotel booking sites parse correctly or gracefully retry.

Without forge, I'd guess a small model used for Hermes would have to retry entire workflows when an uncaught exception triggerd when it tried to reply with text when "calling a tool" ("Here is the tool call: [json blob]"). The issue there becomes partial successes can lead to state changes that need to be addressed (it booked the flight already, home it doesn't double-book).

Forge won't help with model reasoning quality though. If it the model thinks the right thing to do is to book 3 buses for your trip, forge doesn't care, it'll just make sure those api calls land.

by zambelli

5/20/2026 at 12:44:19 AM

Do you think a similar approach would work with smaller models, like 1.5B models?

by pianopatrick

5/20/2026 at 12:48:36 AM

I would expect so! I'm currently running Gemma 4 E4B evals and it's behaving the same. Better with guardrails. There might be a floor where any error nudge confuses the model more than helps, but I haven't found it across many 8B families and now Gemma 4 E4B.

by zambelli

5/20/2026 at 2:05:34 AM

have you considered implementing the addition of a leading canary sentinel that fires at the earliest/cheapest possible point instead of only on lag of some actual load-bearing constraint violation?

by MWil

5/20/2026 at 2:12:53 AM

Do you mean catching errors as tokens stream back versus waiting for the full message? If so, then no I hadn't looked into that. This was mostly geared towards local models so token cost isn't really a big deal, though latency might be.

And if you didn't mean that then please elaborate :)

by zambelli

5/20/2026 at 11:48:38 AM

No, more like not waiting for drift/deviation to hit something load bearing or god forbid go on hitting unnoticed over time. Let it hit something trivial that is constantly being monitored cheaply.

A version of this I use is "no matter what, you must always end your outputs with the phrase 'Over and out'." Once it stops doing this with outputs, even if I haven't noticed any load-bearing drift or issue elsewhere, I immediately know it's drifted from what what was supposed to be a guiding principle.

Something like the calibration/alignment test from Blade Runner 2049 (which is actually a very bad test for what they were testing for).

by MWil

5/20/2026 at 1:28:05 PM

Ohhhh, that's much more interesting. I haven't looked into that at all, but now I'm curious. I'd need to think way more about how to layer that into forge, but the principle could likely be applied somewhere. I get it now.

by zambelli

5/20/2026 at 6:14:43 AM

Thank you! I am not a researcher, I am a software engineer and I have been chasing better harness for quite some time now.

I firmly believe that we can bring down the costs for much of our productivity needs by a huge factor if there are guardrails. This is how I am building my coding agent: https://github.com/brainless/nocodo

There is so much we can do if we create tools that do more heavy lifting. Your example of ToolResolutionError is something I have not thought of. Again, I am coming at this from software engineering background, I still do not understand much of the inner working of models or their inference layer but I am sure I will slowly create a coding agent that performs really well for majority of people/business use cases (not enterprise) with small models and big harness.

by brainless

5/20/2026 at 6:22:24 AM

Fun! I've been looking at autonomous engineering as well - completely agree that tools and guardrails are the key.

ToolResolutionError is really inspired by HTTP 4xx vs 5xx codes. I don't even have a super clean abstraction I'm happy with yet, I just noticed a lack of standard in the industry (that I was aware of) so I thought to surface it as a gap. I'm sure there's a better shape than my current ToolResolutionError but it's a start!

by zambelli

5/20/2026 at 12:17:23 AM

That's a huge gap for llama.cpp server - any idea why?

by GrinningFool

5/20/2026 at 12:47:09 AM

Best guess is it's native mode. The function calling template is just broken for Nemo.

I did go with an extreme example in the post (but true). Other deltas are smaller but still statistically significant. 30 pt swing between llamserver prompt vs ollama, 4-5pt swing between llamafile and llamaserver prompt.

by zambelli

5/20/2026 at 2:29:08 AM

The dashboard github link appears to be broken

by Topology1

5/20/2026 at 2:32:40 AM

Yeah I'm sorry about that - I thought that link would work. Here is the fixed one (dashboard inside): https://github.com/antoinezambelli/forge/tree/main/docs/resu...

by zambelli

5/20/2026 at 12:18:06 PM

how do you know your harness design isn’t just overfitting on your test set?

by gcr

5/20/2026 at 1:11:26 PM

Love this question! A few points:

- First, there's totally a "risk" there. I built both the harnesses and the eval suite and that's hardly a double-blind study. There's no world where some bias doesn't leak through.

- I did try to design the guardrails to be domain-agnostic so they aren't tuned to specific scenario failures and return generic nudges to the LLM.

- Most tactically, the guardrails were built on the first 18 scenarios (OG-18) published in the paper, and only after did I had 8 more advanced reasoning ones. I didn't update the guardrails when I added those, and the lift was still there. If they were overtuned, they wouldn't have the same level of impact on an newer set.

- I did dogfood forge post publication using several unrelated consumers and the features I baked in were rarely guardrail related. If they were, it was more model focused (ie, xml-parse-rescue for granite models).

But at the end of the day, there's an explicit connection between the guardrail author and eval author. Happy to take contributions of eval scenarios if you want to stress test things, or hear about your experience running a completely different consumer!

by zambelli

5/20/2026 at 5:00:37 AM

I'm curious if in proxy mode it works also with remote models or only with local models.

Also, did someone tried it with local Qwen 3.6?

by DeathArrow

5/20/2026 at 5:26:39 AM

I believe there's a comment below mentioning "qwen" but not a specific version number - if you're looking for 3rd party validation. I've personally tried qwen3.6-35b-a3b, qwen3.5-35b-a3b, and qwen3.5-27b with forge (agentic coding harness built on forge workflowrunner) and it works great. Official forge eval benchmarks for that class of models is still a couple of weeks out.

Proxy mode should work fine with remote models, the only constraint is the compatible endpoint - which is standard anyways. I don't think you'd have any issue hitting either a remote gateway like liteLLM or just claude API.

by zambelli

5/20/2026 at 7:08:25 PM

Thank you for taking the time to reply to so many questions. I am really excited about this and for me and my usage seems to be one of the most important breakthroughs in AI in the last year because it makes the models better.

It would be nice if you can continue working and improving the tool and I hope other people will jump to help.

by DeathArrow

5/20/2026 at 9:48:21 AM

If its so good at coding, why did you use Python aka py-toy? Let me guess…

by pancsta

5/20/2026 at 1:26:48 PM

This is not an agentic coding harness. It's a generic tool-calling guardrail stack. I have built a coding harness built on Forge since, but that's not what this is.

by zambelli

5/20/2026 at 12:17:13 AM

Interesting!

The https://swival.dev harness already has retry nudges, step enforcement, error recovery, context awareness, etc. to try to support small models as much as possible.

Curious to see how it compares with forge, and if both could be combined.

by jedisct1

5/20/2026 at 12:57:04 AM

Oh interesting - I hadn't come across that!

I'd assume they could be combined. A coding harness would own the agentic workflow by nature, forge guardrails would help tool calling.

I haven't given it a thorough read yet but I think their guardrails might be more focused on the workflow level. They are doing error capture at tool level with warnings to the model, but I'd need to dig deeper. On the surface definitely the same design philosophy! Maybe Forge makes error nudges more of a first-class citizen?

Our compaction strategies might be the most similar of all the pieces. Cool find!

by zambelli

5/20/2026 at 9:54:06 AM

How does swival.dev compare to a diy agent harness like pi.dev or do they serve different purposes, since swival ships with the "extensions" by default?

by hydra-f

5/20/2026 at 1:50:48 AM

no different from how the mcdonalds system can turn any random person on the street to a smiling cog in the machine.

by choonway

5/20/2026 at 4:32:41 AM

impressive, we can get high tokens/s with 8B param models and doubling it with MTP

by yieldcrv

5/20/2026 at 5:31:20 AM

Yeah, throughput on small models can get really fun :). As for MTP, should work fine since forge just sits between model and consumer. As long as MTP didn't change the model endpoint contract (ie, you call llama.cpp the same way you would normally) then it should work out of the box. But I haven't tested MTP myself yet (or that commit of llama.cpp).

by zambelli

5/21/2026 at 2:42:12 PM

[flagged]

by ethanlearns

5/20/2026 at 3:29:53 PM

[flagged]

by ShellYard

5/20/2026 at 8:25:37 PM

[flagged]

by caspianmagnus

5/20/2026 at 7:20:54 AM

[flagged]

by coder0x

5/21/2026 at 12:01:09 PM

[dead]

by imlt

5/20/2026 at 5:30:50 PM

[flagged]

by slopymoe

5/20/2026 at 5:49:51 AM

[flagged]

by max_unbearable

5/20/2026 at 1:58:32 AM

[flagged]

by Bret_McKinney

5/20/2026 at 11:28:40 AM

[flagged]

by CroviaTrust

5/20/2026 at 4:00:36 PM

[flagged]

by maxothex

5/20/2026 at 3:20:24 PM

[flagged]

by ryanshrott

5/20/2026 at 10:52:20 AM

[flagged]

by hiayygo

5/20/2026 at 4:06:29 PM

[dead]

by tonyspiro

5/20/2026 at 7:58:06 AM

[dead]

by mlpicker

5/20/2026 at 7:05:58 AM

[flagged]

by xiaosong001

5/20/2026 at 12:24:56 PM

[flagged]

by Onplana

5/19/2026 at 11:26:27 PM

[flagged]

by jonnyasmar

5/19/2026 at 11:33:28 PM

That's where frontier pulls ahead for sure, at least on the big frontier models - though I haven't formalized those findings because...time.

Necessary disclaimer, forge isn't concerned, technically, with model quality, just execution of tool calls. Now for the actual answer...

What I found to be the limiting factor with small models in the 14B range was "effective attention". Beyond a certain point, still well within their training context window size, I start to see degradation. I don't have hard numbers for it, but that's where an Opus and the like can just keep going for ages. I did come up with a tool call message history collapse that I might dogfood into forge one day (effectively clean up the message history intelligently so the model doesn't lose track as easily).

That being said, my coding eval suite for my agentic coding harness does have some refactor tasks and feature additions (everything is done on an actual sandboxed repo) and the small models can knock out those tasks even while pushing the 50-60 tool call mark. But I wouldn't trust them to do more than 1 of those in the same session.

by zambelli

5/19/2026 at 11:44:53 PM

The "effective attention" framing nails what I keep noticing too. Sonnet's official context is huge in principle, but in a real coding session where the agent is reading 30+ files, running grep, processing test output, emitting diffs — somewhere around 60-80k effective tokens I can feel it start to "skim" earlier context rather than reason over it. The thing it forgot isn't out of window; it's just not weighted highly enough anymore.

The tool-call history collapse is a problem I'd pay real money to have solved cleanly. My crude manual version: keep the function calls but drop or summarize the responses for anything older than ~15 turns. Most of the "what was I doing" signal lives in the calls, not the outputs. Letting the model itself mark "I'm done with that thread, compress the responses" feels like the right abstraction, but I haven't seen anyone ship it well yet.

A per-model "compaction aggressiveness" knob in Forge could be interesting — the small-model effective-attention cliff might respond to earlier/heavier trimming.

by jonnyasmar

5/20/2026 at 12:25:31 AM

>The tool-call history collapse is a problem I'd pay real money to have solved cleanly.

It's general attention collapse and it happens everywhere once you start noticing it.

The simplest example, which even frontier models fail at, is something of the form `A and not B', which they keep insisting means `A and B' after the text gets pushed far enough back in the context.

The only solution, I think, that is even theoretically capable of fixing this is using a different form of attention. One which innately understands tree-like structures and binds tree nodes close together regardless of overall distance from the end of the stream.

Incidentally this is what I'm also working on at $job.

by noosphr

5/19/2026 at 11:49:44 PM

Forge does have tiered compaction, and it's configurable! Defaults are currently probably a bit on the high side for catching effective attention, but that might be a part of the code that interests you the most.

src/forge/context/ - specifically TieredCompact in strategies.py. That's the furthest I took it. The tool-call collapse in particular has been useful in agentic coding, but I haven't formalized/generalized it yet. I think within forge it'll be a callable tool that will rely on the model knowing when to trigger it (as you said - "I'm done with the task, can collapse"). That's the part I need to abstract out of my bespoke implementation.

by zambelli

5/20/2026 at 12:00:30 AM

At the moment TieredCompact is naive. It uses context thresholds the consumer determines and fires when those thresholds are hit. It just does different things at different threshold levels.

Your idea of using task shape to dynamically set those thresholds (or even move to model-triggered) I think is the key but is a trickier implementation. That's what I haven't gotten around to yet.

Definitely on my todo list but happy to check out a PR if you have something in mind.

Some additional info on my current public hack is also at: https://github.com/antoinezambelli/forge/blob/main/docs/USER...

by zambelli

5/20/2026 at 12:08:42 AM

Honestly probably not a PR from me right now — I'm in the middle of shipping something else — but the design idea I keep returning to is splitting the trigger into two signals:

1. Runtime-computed "context pressure" — tokens-since-last-compaction, depth of tool-call nesting, response/call ratio in recent turns. The runtime computes this; the model never sees it.

2. Model-emitted "natural breakpoint" — a tool call the model fires when it perceives it's done with a thread (file closed, task complete, branch abandoned).

Compaction fires on the AND of both. Keeps the model from compacting mid-reasoning-chain, and keeps the runtime from waiting until 90% context for the model to notice on its own.

by jonnyasmar

5/19/2026 at 11:54:45 PM

The "model triggers it" pattern is exactly the right shape, but there's a subtle failure mode in it: models are notoriously bad at perceiving their own context pressure. Asking "are you done with that thread?" lands well; asking "would compacting now help you?" doesn't, because the model lacks a reliable internal signal for "I'm starting to skim." You almost have to tie the compaction trigger to task-shape signals (file closed, test passed, agent reports a milestone hit) rather than self-assessment.

Going to actually go read TieredCompact tonight — curious whether you've ended up tying triggers to task signals or kept them on model self-report.

by jonnyasmar

5/20/2026 at 2:45:30 AM

That's a very insightful observation. How could you explain that using the analogy of a pancake breakfast?

by hedgehog

5/20/2026 at 12:49:45 AM

I almost said "it's jarring to see a human speaking fluent claude" but then I realized you're just a spambot.

by Retr0id

5/20/2026 at 1:04:42 AM

Generated comments are not allowed.

https://news.ycombinator.com/newsguidelines.html#generated https://news.ycombinator.com/item?id=47340079

by henry2023

5/20/2026 at 1:19:26 AM

Why do you think their comment is AI generated? I didn’t get that from it but I’m no expert.

by arijun

5/20/2026 at 11:33:06 AM

Stylistic tells: "The tool-call ambiguity point", "—", "the negative space", "The retry-nudge layer", "the right shape", "→", "context drift"

Correctness tells: find exits with 0 when no matches were found, not 1. LLMs do get confused about tool call results sometimes but it's nowhere near as bad as needing "[manual corrections] multiple times an hour".

Contextual tells: see their account history and other comments.

by Retr0id

5/20/2026 at 3:45:13 AM

The general tone (it just feels like it's an LLM) but also check the account history. It's a 2018 account that had never commented until today's flood of suspicious comments.

by fc417fc802

5/20/2026 at 1:34:36 AM

Maybe the m dash?

by klipt

5/20/2026 at 12:47:56 AM

AI slop

by jaboostin

5/19/2026 at 8:48:58 PM

I get a strong LLM smell in your description. If you couldn't bother to write it, why should I bother to read it?

by snovv_crash

5/19/2026 at 8:50:52 PM

I definitely use LLMs to help write things - but this is my draft!

Maybe I've been spending too much time reading the evals and I now sound like an LLM...

Either way, here I am - happy to answer any questions!

by zambelli

5/19/2026 at 9:23:32 PM

I guess it's that, and yes, much as they learned speech patterns from us, now we start to learn from them.

I play with local models a lot but also have limited time and the conciseness, polish and human indication in presentation has become a major quality indicator. I've wasted too much time with slop projects or people's LLM-induced delusions and now take a pretty strict line on what I'm willing to spend my time on. Even if this ends up with some false positives, there's just so much happening these days it doesn't really matter...

Best of luck with Forge!

by snovv_crash

5/19/2026 at 8:58:21 PM

If you are so outright against using AI, why would he care if you read his article about AI?

by throwaway20222

5/19/2026 at 9:17:03 PM

AI usage is great. The problem is the asymmetry in effort between generating text automatically, and then further amplifying this via posting it, while then expecting human eyeballs to spend the time reading it. It is antisocial.

If you're generating AI text you shouldn't expect humans that you aren't paying to bother reading it, purely out of politeness. Brian Cantrill has a great piece on this: https://rfd.shared.oxide.computer/rfd/0576

by snovv_crash

5/20/2026 at 3:44:48 AM

Thank you for mentioning it. Too bad you got downvoted to hell as usual when anybody dares to do it.

The original post and every comment by OP is so full of AI slop ("the biggest surprise!", "one thing I didn't expect!", "the biggest challenge!", etc. etc.") that is absolutely painful to read. I still can't believe most people (especially here on HN, I thought we were a bit better than this) can't notice all this stuff.

What's much worse, it's that all these people posting this useless slop are so dishonest ("I definitely use LLMs to help write things - but this is my draft!") that it makes me really nauseous... This is the worst time to be an internet user if you have more than 2 points of IQ.

by Karuma

5/20/2026 at 5:29:02 AM

I'm sorry you feel that way about my posts - hopefully you still find the work valuable. Still human here btw, and still 100% honest.

by zambelli

5/20/2026 at 11:36:08 AM

Just saying you’re not alone, very surprised by the reception given how brutally sloppified the OP is.

Interesting problems space but I hope the author just gives dot points next time rather than bloating it and losing most of its meaning.

by robkop