How I write software with LLMs

3/16/2026 at 6:18:17 AM

Genuine question: what's the evidence that the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?

The author uses different models for each role, which I get. But I run production agents on Opus daily and in my experience, if you give it good context and clear direction in a single conversation, the output is already solid. The ceremony of splitting into "architect" and "developer" feels like it gives you a sense of control and legibility, but I'm not convinced it catches errors that a single model wouldn't catch on its own with a good prompt.

by akhrail1996

3/16/2026 at 7:56:51 AM

This is anecdotal but just a couple days ago, with some colleagues, we conducted a little experiment to gather that evidence.

We used a hierarchy of agents to analyze a requirement, letting agents with different personas (architect, business analyst, security expert, developer, infra etc) discuss a request and distill a solution. They all had access to the source code of the project to work on.

Then we provided the very same input, including the personas' definition, straight to Claude Code, and we compared the result.

They council of agents got to a very good result, consuming about 12$, mostly using Opus 4.6.

To our surprise, going straight with a single prompt in Claude Code got to a similar good result, faster and consuming 0.3$ and mostly using Haiku.

This surely deserves more investigation, but our assumption / hypothesis so far is that coordination and communication between agents has a remarkable cost.

Should this be the case, I personally would not be surprised:

- the reason why we humans do job separation is because we have an inherent limited capacity. We cannot reach the point to be experts in all the needed fields : we just can't acquire the needed knowledge to be good architects, good business analysts, good security experts. Apparently, that's not a problem for a LLM. So, probably, job separation is not a needed pattern as it is for humans.

- Job separation has an inherent high cost and just does not scale. Notably, most of the problems in human organizations are about coordination, and the larger the organization the higher the cost for processes, to the point processed turn in bureaucracy. In IT companies, many problems are at the interface between groups, because the low-bandwidth communication and inherent ambiguity of language. I'm not surprised that a single LLM can communicate with itself way better and cheaper that a council of agents, which inevitably faces the same communication challenges of a society of people.

by arialdomartini

3/16/2026 at 6:35:43 PM

LLMs also don't have the primary advantage humans get from job separation, diverse perspectives. A council of Opuses are all exploring the exact same weights with the exact same hardware, unlike multiple humans with unique brains and memories. Even with different ones, Codex 5.3 is far more similar to Opus than any two humans are to each other. Telling an Opus agent to focus on security puts it in a different part of the weights, but it's the same graph-- it's not really more of an expert than a general Opus agent with a rule to maintain secure practices.

by Miraste

3/16/2026 at 7:08:34 PM

You can differentiate by context, one sees the work session, the other sees just the code. Same model, but different perspectives. Or by model, there are at least 7 decent models between the top 3 providers.

by visarga

3/16/2026 at 6:02:44 PM

Probably the same reason it takes a team of developers and managers 6 months to write what one or two developers can do on their own in one week. The overhead caused by constant meetings and negotiations is massive.

by mikkupikku

3/16/2026 at 8:38:20 AM

This matches what I've seen too. I spent time building multi step agent pipelines early on and ended up ripping most of it out. A single well prompted call with good context does 90% of the work. The coordination overhead between agents isn't just a cost problem it's a debugging nightmare when something goes wrong and you're tracing through 5 agent handoffs.

by nvardakas

3/16/2026 at 9:47:46 AM

If it could be done with 30 cents of Haiku calls, maybe it wasn't a complicated enough project to provide good signal?

by titanomachy

3/16/2026 at 10:20:15 AM

Fair point. I could try with a harder problem. This still does not explain why Claude Code felt the need to use Opus, and why Opus felt the need to burn 12$ or such an easy task. I mean, it's 40 times the cost.

by arialdomartini

3/16/2026 at 10:40:34 AM

I'm a bit confused actually, you said you used Claude Code for both examples? Was that a typo, or was it (1) Claude Code instructed to use a hierarchy of agents and (2) Claude Code allowed to do whatever it wants?

by titanomachy

3/16/2026 at 7:06:33 PM

An ensemble can spot more bugs / fixes than a single model. I run claude, codex and gemini in parallel for reviews.

by visarga

3/16/2026 at 3:44:13 PM

To me, such techniques feel like temporary cudgels that may or may not even help that will be obsolete in 1-6 months.

This is similar to telling Claude Code to write its steps into a separate markdown file, or use separate agents to independently perform many tasks, or some of the other things that were commonly posted about 3-6+ months ago. Now Claude Code does that on its own if necessary, so it's probably a net negative to instruct it separately.

Some prompting techniques seem ageless (e.g. giving it a way to validate its output), but a lot of these feel like temporary scaffolding that I don't see a lot of value in building a workflow around.

by moduspol

3/16/2026 at 6:08:01 PM

Totally agree - the fundamental concept here of automatically improving context control when writing code is absolutely something that will be baked into agents in 6 months. The reason it hasn't yet is mainly because the improvements it makes seem to be very marginal.

You can contrast this to something like reasoning, which offered very large, very clear improvements in fundamental performance, and as a result was tackled very aggressively by all the labs. Or (like you mentioned) todo lists, which gave relatively small gains but were implemented relatively quickly. Automatic context control is just going to take more time to get it right, and the gains will be quite small.

by TheMuenster

3/16/2026 at 7:12:28 PM

Workflow matters too, how you organize your docs, work tasks, reviews. If you do it all by hand you spend a lot of time manually enforcing a process that can be automated.

I think task files with checkable gates are a very interesting animal - they carry intent, plan, work and reviews, at the end of work can become docs. Can be executed, but also passed as value, and reflect on themselves - so they sport homoiconicity and reflexion.

by visarga

3/16/2026 at 7:33:42 AM

There's a lot of cargo culting, but it's inevitable in a situation like this where the truth is model dependent and changing the whole time and people have created companies on the premise they can teach you how to use ai well.

by kybernetikos

3/16/2026 at 2:30:51 PM

Its also inevitable given that we still don't even really know how these models work or what they do at inference time.

We know input/output pairs, when using a reasoning model we can see a separate stream of text that is supposedly insight into what the model is "thinking" during inference, and when using multiple agents we see what text they send to each other. That's it.

by _heimdall

3/16/2026 at 9:39:37 AM

I think this is just anthropomorphism. Sub agents make sense as a context saving mechanism.

Aider did an "architect-editor" split where architect is just a "programmer" who doesn't bother about formatting the changes as diff, then a weak model converts them into diffs and they got better results with it. This is nothing like human teams though.

by never_inline

3/16/2026 at 5:55:43 PM

Absolutely agree with this. The main reason for this improving performance is simply that the context is being better controlled, not that this approach is actually going going to yield better results fundamentally.

Some people have turned context control into hallucinated anthropomorphic frameworks (Gas Town being perhaps the best example). If that's how they prefer to mentally model context control, that's fine. But it's not the anthropomorphism that's helping here.

by TheMuenster

3/16/2026 at 6:59:29 AM

> what's the evidence

What’s the evidence for anything software engineers use? Tests, type checkers, syntax highlighting, IDEs, code review, pair programming, and so on.

In my experience, evidence for the efficacy of software engineering practices falls into two categories:

- the intuitions of developers, based in their experiences.

- scientific studies, which are unconvincing. Some are unconvincing because they attempt to measure the productivity of working software engineers, which is difficult; you have to rely on qualitative measures like manager evaluations or quantitative but meaningless measures like LOC or tickets closed. Others are unconvincing because they instead measure the practice against some well defined task (like a coding puzzle) that is totally unlike actual software engineering.

Evidence for this LLM pattern is the same. Some developers have an intuition it works better.

by jaredklewis

3/16/2026 at 7:16:24 AM

My friend, there’s tons of evidence of all that stuff you talked about in hundreds of papers on arxiv. But you dismiss it entirely in your second bullet point, so I’m not entirely sure what you expect.

by codemog

3/16/2026 at 1:50:42 PM

I’ve read dozens of them and find them unconvincing for the reasons outlined. If you want a more specific critique, link a paper.

I personally like and use tests, formal verification, and so on. But the evidence for these methods are weak.

edit: To be clear, I am not ragging on the researchers. I think it's just kind of an inherently messy field with pretty much endless variables to control for and not a lot of good quantifiable metrics to rely on.

by jaredklewis

3/16/2026 at 8:14:06 AM

[dead]

by ChrisGreenHeur

3/16/2026 at 7:16:55 AM

You can measure customer facing defects.

Also, lines of code is not completely meaningless metric. What one should measure is lines of code that is not verified by compiler. E.g., in C++ you cannot have unbalanced brackets or use incorrectly typed value, but you still may have off-by-one error.

Given all that, you can measure customer facing defect density and compare different tools, whether they are programming languages, IDEs or LLM-supported workflow.

by thesz

3/16/2026 at 7:56:33 AM

> Also, lines of code is not completely meaningless metric.

Comparing lines of code can be meaningful, mostly if you can keep a lot of other things constant, like coding style, developer experience, domain, tech stack. There are many style differences between LLM and human generated code, so that I expect 1000 lines of LLM code do a lot less than 1000 lines of human code, even in the exact same codebase.

by codeflo

3/16/2026 at 7:27:17 AM

The proper metric is the defect escape rate.

by jacquesm

3/16/2026 at 7:35:03 AM

Now you have to count defects

by exidex

3/16/2026 at 7:36:40 AM

You have to do that anyway, and in fact you probably were already doing that. If you do not track this then you are leaving a lot on the table.

by jacquesm

3/16/2026 at 10:49:10 AM

I was more thinking in terms of creating a benchmark which would optimized during training. For regular projects, I agree, you have to count that anyway

by exidex

3/16/2026 at 7:49:41 AM

Most developer intuitions are wrong.

See: OOP

by slopinthebag

3/16/2026 at 8:50:07 AM

Intuition is subjective. It's hard to convert subjective experience to objective facts.

by vbezhenar

3/16/2026 at 9:23:35 AM

That's what science is though * our intuition/ hunch/ guess is X * now let's design an experiment which can falsify X

by tomgp

3/16/2026 at 8:58:21 AM

The different models is a big one. In my workflow, I've got opus doing the deep thinking, and kimi doing the implementation. It helps manage costs.

Sample size of one, but I found it helps guard against the model drifting off. My different agents have different permissions. The worker can not edit the plan. The QA or planner can't modify the code. This is something I sometimes catch codex doing, modifying unrelated stuff while working.

by lbreakjai

3/16/2026 at 9:28:09 AM

I recently had a horrible misalignment issue with a 1 agent loop. I've never done RL research, but this kind of shit was the exact kind of thing I heard about in RL papers - shimming out what should be network tests by echoing "completed" with the 'verification' being grepping for "completed", and then actually going and marking that off as "done" in the plan doc...

Admittedly I was using gsdv2; I've never had this issue with codex and claude. Sure, some RL hacking such as silent defaults or overly defensive code for no reason. Nothing that seemed basically actively malicious such as the above though. Still, gsdv2 is a 1-agent scaffolding pipeline.

I think the issue is that these 1-agent pipelines are "YOU MUST PLAN IMPLEMENT VERIFY EVERYTHING YOURSELF!" and extremely aggressive language like that. I think that kind of language coerces the agent to do actively malicious hacks, especially if the pipeline itself doesn't see "I am blocked, shifting tasks" as a valid outcome.

1-agent pipelines are like a horrible horrible DFS. I still somewhat function when I'm in DFS mode, but that's because I have longer memory than a goldfish.

by sigbottle

3/16/2026 at 7:16:40 AM

After "fully vibecoding" (i.e. I don't read the code) a few projects, the important aspect of this isn't so much the different agents, but the development process.

Ironically, it resembles waterfall much more so than agile, in that you spec everything (tech stack, packages, open questions, etc.) up front and then pass that spec to an implementation stage. From here you either iterate, or create a PR.

Even with agile, it's similar, in that you have some high-level customer need, pass that to the dev team, and then pass their output to QA.

What's the evidence? Admittedly anecdotal, as I'm not sure of any benchmarks that test this thoroughly, but in my experience this flow helps avoid the pitfall of slop that occurs when you let the agent run wild until it's "done."

"Done" is often subjective, and you can absolutely reach a done state just with vanilla codex/claude code.

Note: I don't use a hierarchy of agents, but my process follows a similar design/plan -> implement -> debug iteration flow.

by jumploops

3/16/2026 at 7:17:31 AM

I think the splitting make sense to give more specific prompts and isolated context to different agents. The "architect" does not need to have the code style guide in its context, that actually could be misleading and contains information that drives it away from the architecture

by totomz

3/16/2026 at 7:59:42 AM

Wouldn’t skills already solve this? A harness can start a new agent with a specific skill if it thinks that makes sense.

by ako

3/16/2026 at 7:19:34 AM

> the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?

There's a 63 pages paper with mathematical proof if you really into this.

https://arxiv.org/html/2601.03220v1

My takeaway: AI learns from real-world texts, and real-world corpus are used to have a role split of architect/developer/reviewer

by est

3/16/2026 at 7:46:43 AM

>> the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?

> There's a 63 page paper with mathematical proof if you really into this.

> https://arxiv.org/html/2601.03220v1

I'm confused. The linked paper is not primarily a mathematics paper, and to the extent that it is, proves nothing remotely like the question that was asked.

by codeflo

3/16/2026 at 9:30:25 AM

> proves nothing remotely like the question that was asked

I am not an expert, but by my understanding, the paper prooves that a computationally bounded "observer" may fail to extract all the structure present in the model in one computation. aka you can't always one-shot perfect code.

However, arrange many pipelines of roles "observers" may gradually get you there

by est

3/16/2026 at 7:59:31 AM

Perhaps this paper might be more relevant with regards to multi-agent pipelines https://arxiv.org/html/2404.04834v4

by cbg0

3/16/2026 at 7:50:21 AM

Can you explain how this paper is relevant to the comment you replied to?

by anhner

3/16/2026 at 10:52:30 AM

It's not about splitting for quality, it's about cost optimisation (Sonnet implements, which is cheaper). The quality comes with the reviewers.

Notice that I didn't split out any roles that use the same model, as I don't think it makes sense to use new roles just to use roles.

by stavros

3/16/2026 at 10:45:46 AM

Nitpick: I don’t think architect is a good name for this role. It’s more of a technical project kickoff function: these are the things we anticipate we need to do, these are the risks etc.

I do find it different from the thinking that one does when writing code so I’m not surprised to find it useful to separate the step into different context, with different tools.

Is it useful to tell something “you are an architect?” I doubt it but I don’t have proof apart from getting reasonable results without it.

With human teams I expect every developer to learn how to do this, for their own good and to prevent bottlenecks on one person. I usually find this to be a signal of good outcomes and so I question the wisdom of biasing the LLM towards training data that originates in spaces where “architect” is a job title.

by zingar

3/16/2026 at 8:38:21 AM

Yeah always seemed pretty sus to me to.

At the same time I can see a more linear approach doing similar. Like when I ask for an implementation plan that is functional not all that different from an architect agent even if not wrapped in such a persona

by Havoc