5/7/2026 at 8:36:37 PM
1000% agree. I am increasingly hesitant to believe Anthropic's continual war drum of "build for the capabilities of future models, they'll get better".We've got a QA agent that needs to run through, say, 200 markdown files of requirements in a browser session. Its a cool system that has really helped improve our team's efficiency. For the longest time we tried everything to get a prompt like the following working: "Look in this directory at the requirements files. For each requirement file, create a todo list item to determine if the application meets the requirements outlined in that file". In other words: Letting the model manage the high level control flow.
This started breaking down after ~30 files. Sometimes it would miss a file. Sometimes it would triple-test a bundle of files and take 10 minutes instead of 3. An error in one file would convince it it needs to re-test four previous files, for no reason. It was very frustrating. We quickly discovered during testing that there was no consistency to its (Opus 4.6 and GPT 5.4 IIRC) ability to actually orchestrate the workflow. Sometimes it would work, sometimes it wouldn't. I've also tested it once or twice against Opus 4.7 and GPT 5.5; not as extensively; but seems to have the same problems.
We ended up creating a super basic deterministic harness around the model. For each test case, trigger the model to test that test case, store results in an array, write results to file. This has made the system a billion times more reliable. But, its also made the agent impossible to run on any managed agent platform (Cursor Cloud Agents, Anthropic, etc) because they're all so gigapilled on "the agent has to run everything" that they can't see how valuable these systems can be if you just add a wee bit of determinism to them at the right place.
by 827a
5/7/2026 at 9:00:08 PM
I used to assume they pushed people into the prompt-only workflows because you’re paying them for the tokens, and not paying them for the scaffolding you built. However, I think that they’re really worried about is that a person needs to design and implement that stuff… It throws a wet blanket on their insistence that this will replace entire people in entire workflows or even projects, and I just don’t buy it. I do think it’s going to increase productivity enough to disastrously affect developer job market/pay scale, but I just don’t think this particular version of this particular technology is going to actually do what they say it will. If they said they were spending this much money bootstrapping a super useful thingy that can reduce a big chunk of the busy work of a human dev team— what most developers really want, and most executives really don’t— a bunch of investors would make them walk the plank.I also think having granular, tightly controlled steps is much friendlier to implementing smaller, cheaper, more specialized models rather than using some ginormous behemoth of a model that can automate your tests, or crank out 5 novels of CSI fan fic in a snap.
by DrewADesign
5/7/2026 at 9:43:32 PM
> However, I think that they’re really worried about is that a person needs to design and implement that stuff… It throws a wet blanket on their insistence that this will replace entire people in entire workflows or even projects, and I just don’t buy it.I think you are on to something. But I also think this sort of system lends itself to not needing really good LLMs to do impressive things. I've noticed that the quality of a lot of these LLMs just gets worse the more datapoints they need to track. But, if you break it up into smaller and easier to consume chunks all the sudden you need a much less capable LLM to get results comparable or better than the SOTA.
Why pay extra money for Opus 4.7 when you could run Qwen 3.6 35b for free and get similar results?
by cogman10
5/8/2026 at 3:01:28 AM
And then you realize that what you’re using the smaller models for is ALSO decomposable and part of it is just a few if statements, and then you realize that for this feature you don’t actually need or want a model because the performance, reliability, reproducibility are cheaper and better for you and your users.by devin
5/8/2026 at 3:17:21 AM
So you have the model write the if statements and put itself out of a job.by jimbokun
5/8/2026 at 2:10:40 PM
Alternatively, and sometimes more cost-efficient: you can find a developer who can write bespoke if statements. There are dozens of us!by gman2093
5/8/2026 at 6:28:28 PM
So, are we going to end up with a mechanical Turk that pretends it is an LLM but just farms out tasks to gig workers?by saltcured
5/8/2026 at 8:31:29 PM
Additionally, developers tend to become less expensive as venture capitalists turn off the spigot, while access to giant frontier models becomes way more expensive. Beyond that, a developer might go out and have a beer with you after work, which appeals to the sickos that have the gall to prioritize humanity over fanatical efficiency for corporate gains.by DrewADesign
5/8/2026 at 5:37:35 AM
Indeed, I've been experimenting with agent workflows, for complicated tasks - where I essentially have a graph of agents with different roles/capabilities, including such things as breaking down complex tasks into simpler ones. There seems to be a point where a complex enough task is better performed by a group of cheaper agents/models than by one agent using one of the SOTA big models, in terms of both quality and cost.by aleqs
5/8/2026 at 1:22:42 PM
The big SOTA models win in world knowledge, that's what all those parameters are for. But a huge fraction of agentic tasks is going to be plain clerical work that needs no special knowledge at all, a much simpler model can do them in a straightforward way.by zozbot234
5/8/2026 at 12:08:31 AM
It is also interesting because you get people with very different use cases arguing about the effectiveness of various models but doing very different things with them.Its one things for a model to be very clearly instructed to add a REST endpoint to an existing Django app and add a button connected to it on the front vs "Design me a youtube". The smaller models can pretty dependably do the first and fall flat on the second.
by tempest_
5/8/2026 at 1:19:02 PM
> However, I think that they’re really worried about is that a person needs to design and implement that stuff… It throws a wet blanket on their insistence that this will replace entire people in entire workflows or even projectsYou can have the AI design the custom harness in advance. It's not especially hard work! In fact, the AI could even come up with the workflow itself; it's a different and much simpler problem than trying to stick to it after-the-fact, with a filled-in context.
by zozbot234
5/8/2026 at 7:59:48 PM
That's what my system does. It uses a workflow if one already exists, if not, it just creates one on the fly from the primitives.https://github.com/notque/vexjoy-agent
I would prefer that be deterministic though. This thread has me considering what if anything I can do to make it forced. Like, I could do it with hooks, but that's not elegant at all.
by AndyNemmity
5/8/2026 at 11:54:57 PM
yeah , that is what I am do in with the DAG-aware TUI hypervisor agents https://getspur.devby mf_kevintruong
5/7/2026 at 10:41:49 PM
The designing and implementing of a code harness in your workflow can be as simple as running something like /skill-builder.You prompt for what you want it to do, and it will write eg. python scripts as needed for the looping part, and for example use claude -p for the LLM call.
You can build this in 10 minutes.
I don’t use a cloud platform, so I can’t comment on that part. I‘d say just run it on your own hardware, it’s probably cheaper too.
by user34283
5/7/2026 at 9:06:09 PM
Aren't they just buying time to build you whatever harness you need? They want to be the only software engineering shop in the world.by pishpash
5/8/2026 at 4:27:18 AM
Secret: "compile" that orchestration prompt. Determinism is solved by turning prompts into code that can in turn run agents or run code or both.Everyone misses this pattern with skills: you can just drop code alongside a SKILL.md to guarantee certain behaviors, but for some reason everyone's addicted to writing prompts. You don't even need to build a CLI. A simple skill.py with tasks does it. You can even have helpers that call `claude -p`!
by fny
5/8/2026 at 1:38:08 PM
What about when the model trusts itself more than the "black box" you gave it, and hallucinates its use or non-use in favor of reimplementation? I found this video about "intelligent disobedience" interesting.by Cyan488
5/8/2026 at 8:01:18 PM
Yeah, that's how I do skills. If I can make a script, I do. Everything that can be deterministic should be. https://github.com/notque/vexjoy-agentby AndyNemmity
5/8/2026 at 12:13:10 PM
Exactly this, I tend to work this way. I built an ingestion pipeline to pull concepts out of a novel using Qwen and push them into falkordb this wayby robinduckett
5/8/2026 at 9:28:21 AM
Could you elaborate what does "compiling orchestration prompt" mean?by krzyk
5/8/2026 at 10:59:32 AM
When you get some abstraction working you concretize it in something deterministic, or sort of “cache” that knowledge bit (aka write me a function, class, library, whatever). In the future, the nondeterministic path now has a deterministic piece to lean on as it explores the problem space. Rinse, repeat, eventually you have a mostly deterministic system now. Leave flexibility in space where you need that nondeterminism.by Frost1x
5/8/2026 at 11:20:38 AM
Rather than telling the LLM "loop through these files", tell it "write a script to loop through these files", then hard-code that script somewhere.by LikesPwsh
5/8/2026 at 2:46:10 PM
The models will eventually be able to know that they need to do that to get the thing done from natural languageby whattheheckheck
5/10/2026 at 9:54:56 PM
"The models will eventually..." Yeah but they haven't, and it's been years now. Also who cares? We have problems right now that need to be solved.by suttontom
5/8/2026 at 6:02:12 PM
First we gave LLMs access to bash commands. Now we give them access to customized commands which they can reuse. It's English language extending its claws into deterministic programming language. Now can we please have backtracking and dynamic programming like thinking loop built into English language or such orchestration prompts.by renticulous
5/8/2026 at 10:34:53 AM
a guess but i think they mean take the orchestration prompt and prompt yet another llm to turn that prompt into code..?by throawayonthe
5/8/2026 at 12:03:47 AM
I saw a major uplift in performance after I combined tools like apply_patch with check_compilation & run_unit_tests. I still call the tool "apply_patch", but it now returns additional information about the build & tests if the patch succeeds. The agent went from ~80% success rate to what seems to be deterministic (so far). I don't bother to describe the compilation and unit testing processes in my prompts anymore. All I need to do is return the results of these things after something triggers them to run as a dependency.I feel like I'm falling out of whatever is popular these days. I've been using prepaid tokens and custom harnesses for a long time now. It just seems to work. I can ignore most of the news. Copilot & friends are currently dead to me for the problems I've expressly targeted. For some codebases it's not even in the same room of performance anymore, despite using the exact same GPT5.4 base model.
by bob1029
5/8/2026 at 5:14:55 PM
I like this - I think you're not too far off of what's popular these days though. I think similar functionality can be achieved by using the "hook" functionality in claude code / codex.by modo_
5/9/2026 at 1:26:42 AM
Can you explain in more detail how you implemented those tools? Is that via a MCP server?by bostonvaulter2
5/10/2026 at 11:38:49 PM
> Is that via a MCP server?No, this all in one application. A Winforms+WebView2 app wraps the chat completion APIs and implements the various tools directly.
by bob1029
5/7/2026 at 9:21:49 PM
I have but one upvote, but yes. The only way to make these systems work reliably is to break the problems down into smaller chunks. Any internal consistency checks are just going to show you that LLMs are way less consistent than you’d expect.by woeirua
5/7/2026 at 9:51:58 PM
I had to create a hypothesis testing agent where it gets a query like "is manufacturing parameter x significantly different this month than last month" and have the agent follow a flowchart to run a statistical test and return the answerAt the time I had access to only 4o and there was no way to guarantee that the agent would follow the flowchart if I just mention it in its prompt. What I ended up wrapping the agent in a loop that kept feeding it the next step in the flowchart. In a way, a custom harness for the agent
by rdedev
5/8/2026 at 1:39:42 AM
Wouldn't it be more efficient to convert the requirements these 200 markdown files into Playwright tests?You could still use an LLM to write and extend the tests, but running the tests would be deterministic and would use less resources.
by cheshire_cat
5/8/2026 at 1:52:54 AM
This type of thing so much.AI is being pushed so much at work right now. For non-dev stuff even. The amount of things that people think are "awesome never seen this" is staggering.
Just because you haven't seen file format X converted to file format Y before and now you asked the LLM to do it and it worked, doesn't mean you needed an LLM for it nor that it's remarkable. The LLM knew how to do it because it learned from a bazillion online sources for deterministic converters that cost nothing (and have open source). But now you're paying, every single time, for a non-deterministic version of it and you find it cool. It's magic ...
But I guess they deserve it.
by tharkun__
5/8/2026 at 3:14:45 AM
> It's magicyou'll be surprised with how many people are comfortable attributing something they do not understand to Magic.
more than anything, ai let people who couldn't and wouldn't bother to learn to write simple code, to side step ones who can and build solutions to scratch their own itch. that too faster.
now human behavior kicks in, and they don't want to hand control back into the hands of people who can code to solve problems.
put this together and you have a good model to understand the AI sales pitch... Its magic
like all magic, its but a trick.
by gofreddygo
5/8/2026 at 9:00:25 AM
Oh, yes! As someone who has dabbled in card tricks, this so much. People don't understand how its done and can't imagine or conceive of a way that it possibly could be done, so they attribute it to literal magic or demons or whatever. Like, no, I just distracted you for a split second and used sleight of hand.Technology is no different: someone has never even considered that this thing could be possible, and now they see it with their own eyes? Incredible! They don't realise that its mundane and has been possible (in much cheaper ways) for a long time. It was like a few years ago when some journalist posted an animation showing how Horizon Zero Dawn does frustum culling and all the non-tech people were all "wow! This game unloads the game world when its not in view! Incredible!", like... yeah? That's how games have worked since the advent of 3D?
by dkersten
5/7/2026 at 11:23:14 PM
> This started breaking down after ~30 files. Sometimes it would miss a file. Sometimes it would triple-test a bundle of files and take 10 minutes instead of 3. An error in one file would convince it it needs to re-test four previous files, for no reason. It was very frustrating.Sorry, you thought a prompt was a suitable replacement for a testing suite?
by julianlam
5/8/2026 at 12:55:28 AM
hey man it works great barely and also costs a bunch of money everytime we run it. we also can't trust the results, relax.by zapataband1
5/8/2026 at 2:18:13 AM
If you are invested in AI stocks, this is the way. You are basically funneling money from software companies into your brokerage account. Keep going.by deadbabe
5/8/2026 at 10:33:10 AM
> But, its also made the agent impossible to run on any managed agent platform (Cursor Cloud Agents, Anthropic, etc)couldn't you "just" have it orchestrate a bunch of subagents? a la the superpower skill
definitely a worse solution, non deterministic orchestration + way higher token usage (unless there's a way to hide the subagent output from the orchestrator agent? i haven't used any of these platforms) but could work in some cases
by throawayonthe
5/7/2026 at 9:02:30 PM
> This started breaking down after ~30 files.Codex's short context and todolist system combined somehow helps here though. Because of the frequent compact. The model was forced to recheck what todo list item has not done yet and what workflow skill it has to use. I used to left it for multi hour to do a big clean up and it finished without obvious issues.
by mmis1000
5/7/2026 at 9:08:19 PM
Is Codex willing to do "multi hour" tasks when used with a ChatGPT Plus subscription, or does it need something more expensive like Pro?by swores
5/7/2026 at 9:54:24 PM
I regularly get codex to do multi hour tasks with a single prompts I don't think thats a big deal anymore. But you don't want a single agent doing all the work. The root agent needs to delegate the work to sub agents. For example, a sub agent for context gathering, then one for planning, then one (or more) for implementation, then another for review. This way the root agent doesn't use up its context window and it just manages from a bird's eye view. I do have the $200 plan though.by dnh44
5/8/2026 at 6:58:07 AM
It's going to work the same regardless of how much you pay, but with Plus you'll run into 5h usage limit rather quickly unless your "multi hour task" spends 90% of the time just waiting around for code to compile. Expect to get an hour or two of active work (single-threaded).by dns_snek
5/8/2026 at 9:17:57 AM
If you have any org email, you can get free chatgpt + subscription.by shivnathtathe
5/8/2026 at 1:23:46 PM
Google ADK might be useful, especially v2 reorients it around graph operators for control flow.Your specific case is listed in the v2 docs with an operation that fans out to parallel many tasks then joins the results.
by data-ottawa
5/8/2026 at 7:50:02 AM
I never tell claude to "go over this bunch of files and do this".I tell it "write a program that goes over this bunch of files and do this".
Sometimes "do this" can be invoking another claude instance.
by otikik
5/7/2026 at 8:40:33 PM
I’m working on a hybrid system of old school task graph and ai agents and let them instantiate each other. I think others will do that eventually.by sroussey
5/7/2026 at 9:57:35 PM
I'm working on something similar (won't link to it as don't want people to think I'm spamming) but if you want to compare notes happy to talk.by tonylucas
5/7/2026 at 9:03:19 PM
Jira for agents?by cluckindan
5/8/2026 at 4:48:37 AM
c.f. Linear for Agentsby werrett
5/7/2026 at 10:14:47 PM
Our team at Agentforce recently open-sourced our solution to this and we've gotten very valuable feedback -- would love to hear from more of you about it: https://github.com/salesforce/agentscriptby crsn
5/8/2026 at 12:58:02 AM
No you didn't"What we're not open sourcing (yet) is the runtime. "
by zapataband1
5/8/2026 at 6:43:37 PM
If it actually takes off, expect a vibecoded runtime that everyone runs on their own systems.by tadfisher
5/7/2026 at 10:23:16 PM
The other part of the question is exactly when the "build for the capabilities of future models" becomes the present.Looking at the Mythos benchmarks, it doesn't seem like the models are that close to being truly reliable for agentic tasks.
Is it a year away, or five? That's a big difference in deciding what to build today.
by awongh
5/8/2026 at 4:14:45 AM
> We've got a QA agent that needs to run through, say, 200 markdown files of requirements in a browser session. Its a cool system that has really helped improve our team's efficiency. For the longest time we tried everything to get a prompt like the following working: "Look in this directory at the requirements files. For each requirement file, create a todo list item to determine if the application meets the requirements outlined in that file". In other words: Letting the model manage the high level control flow.This is cool. Can you elaborate on it? Is it flaky? Does it take a long time?
by krashidov
5/7/2026 at 9:10:26 PM
You could have a skill that is the combination of a minimal markdown file and a set of orchestration scripts that do the deterministic work. The agent does not have to “run everything”, it just needs to know how to launch the right script.by Joeri
5/8/2026 at 8:02:30 PM
For sure, this is the pattern I use.And I wish I could make even more deterministic. Maybe I can, but it can also be a bit challenging to sort.
by AndyNemmity
5/8/2026 at 7:27:24 AM
I'm personally surprised by this too. Like, everyone is writing how insanely productive AI is making them, but that productivity doesn't seem to have translated into any innovations beyond model quality.Like, most of the stuff needed to make AI better is stuff that could have been written by hand in 2015, so why hasn't anyone used their agents to do so?
To be fair, there is probably a way to make it work the way you want. You could add an MCP for a task queue and let the model work each item in the task queue. The tasks could be added by a deterministic system i.e. your harness.
by imtringued
5/8/2026 at 6:27:41 AM
This might be inherent to how the models are benchmarked.Aren’t some benchmarks giving the model multiple shots at a problem and only keep the successful result if it appeared, ignoring the failure rate?
by jiehong
5/8/2026 at 6:34:55 AM
Good point. We need the mean, “any 1 of 10” and the “all 10 of 10” success rates in the metrics, so we can estimate reliability (the last one).by andyferris
5/8/2026 at 3:39:28 PM
I m running into similar issues, more and more i’m removing complexity from the agent to the (Go) logic in order to make it more deterministic.To be more precise; everything is prepared in the form of files instead of letting the subagents making api/cli calls. And still - sometimes (even with enough context) the main agent takes strange turns.
by stpedgwdgfhgdd
5/7/2026 at 9:52:22 PM
So I wonder, if a more powerful agent harness could have the agent basically write and exectute its own deteministic code, which when executed, spawns sub agents for each of the subtasks?So far we've seen agents spawn subagents directly, but that still means leaving the final flow control to the non-deterministic orchestrator model, and so your case is a perfect example of where it would probably fail.
by sharperguy
5/7/2026 at 9:55:59 PM
I've been working on an integrated deterministic/agent integrated system for a few months now. It basically runs an AI step to build a plan, which biases towards deterministic steps as much as possible but escalates back to AI when it needs to (for AI only capabilities or deterministic failures) so effectively (when I perfect it, I'm about 90% there) it can bounce back and forward as needed with deterministic steps launching AI steps and AI steps launching deterministic steps as needed.Probably not explaining it very well but I think it's pretty effective at reducing token usage.
by tonylucas
5/8/2026 at 9:06:36 AM
I've been building a workflow engine for agent orchestration and the workflows are just data for the engine to execute. While I haven't experimented with it yet, I envision that an LLM would be rather good at generating the workflows based on a description of your needs (and context about how best to utilise the workflow engine).LLM's are pretty good at reasoning about workflows, its just that when they have to apply them directly, the workflow context gets muddled with your actual tasks context. That's why using an orchestration agent that delegates work to worker agents works so much better.
I still think there's a huge amount of value in having the workflow executed in a deterministic way (as code, or by a workflow engine) because it saves tokens, eliminates any possibility of not following it, and unlocks other cool things, like being able to give each step in the workflow its own focused task-specific context, splitting plans into individual actions and feeding them through a workflow one by one, and having workflow-step specific verification.
But that workflow absolutely CAN be created by an LLM, it just shouldn't be executed by one.
by dkersten
5/7/2026 at 10:23:16 PM
[flagged]by shripadt
5/8/2026 at 12:49:58 AM
I make codex do everything through a giant `justfile`. Simple, greppable, self-documenting, works great, and I don’t even need to read it.by peyton
5/7/2026 at 9:08:03 PM
Can you not have it write your harness for you, or have it be the first step? You can push your own determinism where you need, surely.by pishpash
5/7/2026 at 9:21:42 PM
True. The prompt reads: Run the following Python: ```by svachalek
5/8/2026 at 8:48:53 AM
Isn't this already possible to implement with skills and subagents? Like have a skill saying "to test these files run this script that executes a subagent for every markdown file, then check the results".by andy12_
5/8/2026 at 6:18:42 PM
The agent can do this one by one in an agentic loop, storing the progress and backlog in filesIf nothing is stored in durable memory then the context window is going to get rotten
by rmaxdev
5/8/2026 at 5:25:12 PM
I almost always use orchestration tooling nowadays. cc itself feels too basic, even with things like superpowers.by nvarsj
5/8/2026 at 12:50:32 PM
This was a great example, thanksby vishna
5/8/2026 at 12:55:46 AM
[flagged]by zapataband1
5/8/2026 at 2:18:41 AM
From the site guidelines (https://news.ycombinator.com/newsguidelines.html):> Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.
by BalinKing