Agent-to-agent pair programming

3/27/2026 at 8:04:25 AM

I’m curious whether anyone has measured this systematically. Right now most of the evidence for multi-agent setups still feels anecdotal.

by yesensm

3/29/2026 at 2:04:56 PM

I ran a Claude Code pipeline across 98 Rails models, 9,000 tests total. The thing that made it repeatable: each agent can only do one job. The analyzer writes a YAML plan but no Ruby. The writer gets one slice of that plan, not the whole thing. A Ruby script (not the AI) checks the output for 138 known mistakes before anything moves forward. If the check fails, the agent has to fix it.

That Ruby script is where the data comes from. I can see exactly which checks caught which problems across all 98 runs. That is not anecdotal, it is a log file.

by viktorianer

3/27/2026 at 11:27:40 AM

And expensive, exactly the way a pay per use product would push its customers…

“It’s not working well enough!” We tell them. They respond with “Have you tried using it more?”

by not_ai

3/27/2026 at 1:31:50 PM

Back in 2024 I read a study saying: "Ask 4 LLMs the same question, if they all give you the same answer there is some 95-99% chance its correct"

Soooo... Its not just greed. There is something there.

by 3yr-i-frew-up

3/27/2026 at 2:58:02 PM

Yes exactly. I’m talking about this in the article. I found out that when Claude and Codex both review the same PR and both find the same issue, our team fixes it 100% of the time.

by axldelafosse

3/27/2026 at 4:36:57 PM

What's the point of pair programming then if they both have the same opinions?

by zombot

3/27/2026 at 5:38:44 PM

They don't. And you would be surprised how a good model actually pushes back on some comments.

The point was: when they do agree, it is a very strong signal.

by axldelafosse

3/27/2026 at 4:54:54 PM

There are a number of different models out there.

by pixl97

3/27/2026 at 1:22:18 PM

Haha yeah... Wait until they start jacking up the subscription prices

by shafyy

3/27/2026 at 3:20:55 PM

They don't change the prices, they just modify the amount of compute allocated - slower speeds and fewer tokens, they can set everything in the background to optimize costs and returns, and the user never realizes anything has changed.

Sometimes they'll announce the changes, and they'll even try to spin it as improving services or increasing value.

Local AI capabilities are improving at a rapid pace, at some point soon we'll have an RWKV or a 4B LLM that performs at a GPT-5 level, with reasoning and all the bells and whistles, and hopefully that'll shake out most of the deceptive and shady tactics the big platforms are using.

by observationist

3/28/2026 at 2:43:59 PM

> They don't change the prices, they just modify the amount of compute allocated - slower speeds and fewer tokens, they can set everything in the background to optimize costs and returns, and the user never realizes anything has changed.

I can't imagine that this is the way it will go... Tokens haven't been getting cheaper for flagship models, have they? You already see something closer to their real cost if you compare e.g. the Claude subscriptions to their actual token pricing.

> Local AI capabilities are improving at a rapid pace, at some point soon we'll have an RWKV or a 4B LLM that performs at a GPT-5 level, with reasoning and all the bells and whistles, and hopefully that'll shake out most of the deceptive and shady tactics the big platforms are using.

Maybe, but LLMs are scale game, and data center will always be more capable than your local device. So, you will always be getting a worse version locally. Or do you think we'll LLMs in data centers stop getting better and local LLMs will somehow catch up?

by shafyy

3/27/2026 at 9:13:43 AM

Completely with you on this! But then we need to define the cirteria for comparison. Might not be that easy unfortunately

by stackgrid

3/27/2026 at 6:49:48 AM

Nice - I do something similar in a semi manual way.

I do find Codex very good at reviewing work marked as completed by Claude, especially when I get Claude to write up its work with a why,where & how doc.

It’s very rare Claude has fully completed the task successfully and Codex doesn’t find issues.

by edf13

3/27/2026 at 6:58:17 AM

I created the first version of loop after getting tired of doing this manually!

by axldelafosse

3/27/2026 at 7:12:24 AM

I’m going to take a look today!

by edf13

3/27/2026 at 9:52:12 AM

Claude is also good at that. I made a habit of asking "are you sure?" after a complex task. It usually says it overlooked something.

by nurettin

3/27/2026 at 12:42:56 PM

I find both to be true. I use Claude for most of the implementation, and Codex always catches mistakes. Always. But both of them benefit from being asked if they’re sure they did everything.

by ctmnt

3/27/2026 at 1:44:55 PM

Do you see any benefit in doing this locally versus having Codex review the PR Claude generates?

by lancekey

3/27/2026 at 5:41:28 PM

The feedback loop is faster. But PR reviews are still useful as they are multiplayer (meaning that you and another human reviewer can talk about a specific agent's comment directly on the diff, which is very useful sometimes).

by axldelafosse

3/27/2026 at 3:31:35 PM

I think they're trying to implement every management fad with AI agents and see if improves performance.

Personally, I have tried pair programming, and it hasn't really felt like something that works, for various reasons - the main one is that I (and my partner) have complex thought processes in my head, that is difficult and cumbersome to articulate, and to an onlooker, it looks like I'm randomly changing code.

by torginus

3/27/2026 at 4:34:57 PM

I worked at Pivotal Labs where hundreds of developers pair programmed every day, all day. It works, the trick is learning how to get out of your head and communicate with your pair in a way that two brains works better than one.

I agree, it isn't for everyone.

by latchkey

3/28/2026 at 6:44:10 PM

Pair programming is like guiding the mechs in Pacific Rim (too bad they never made a sequel).

You need to communicate actively and be in sync with your pair. I've seen it work _once_ with two guys who had know each other since they were kids. They were really more than the sum of their parts when pair programming.

by theshrike79

3/27/2026 at 4:37:55 PM

Pair programming works best when you are tasked with a problem that’s actually beyond your current abilities. You spend less time in your head because you are exploring a solution space for the first time.

by encoderer

3/27/2026 at 3:56:06 AM

The vibes are great. But there’s a need for more science on this multi agent thing.

by cadamsdotcom

3/27/2026 at 4:44:35 AM

I agree! Right now it is leveraging the Codex App Server, which is open-source and very well implemented, but using Claude Code Channels is probably a bit hacky.

The good thing is that it establishes a direct connection so it's already much better than having one agent spawn the other and wait for its output, or read/write to a shared .md file -- but it would be cool to make it work for all agent harnesses.

Open to ideas! The repo is open-source.

by axldelafosse

3/27/2026 at 5:57:28 AM

This one: https://github.com/openai/codex/tree/main/codex-rs/app-serve...

by SeriousM

3/27/2026 at 5:49:58 AM

[flagged]

by d0963319287

3/27/2026 at 3:14:59 AM

I have been trying a similar setup since last week using https://rjcorwin.github.io/cook/

by alienreborn

3/27/2026 at 4:46:58 AM

Oh, that's cool!

by axldelafosse

3/27/2026 at 10:40:35 AM

If this approach turns out to be valuable, it's unlikely that it has anything to do with having multiple actual agents, but rather that it's valuable to have 2 configurations (system prompt, model, temp, context pruning, toolset etc.) of inside the same agent being swapped back and forth.

by dgb23

3/27/2026 at 3:03:27 PM

Yeah maybe! Right now I find it useful to use different agent harnesses but as the models get better (and the agent harnesses simpler), it might be possible to get the same result with the same model. Would be cool to experiment with open-source models

by axldelafosse

3/27/2026 at 11:51:19 AM

The PLAN.md question is the one worth pulling on. Once the plan lives in git or the PR it's already downstream of intent and whoever defined what to build has already handed off. The harder problem is giving agents access to the original intent, not just the implementation plan derived from it. When there's drift between what was planned and what got built, a git-resident PLAN.md makes it hard to trace back to why the decision was made in the first place.

by sibtain1997

3/27/2026 at 1:40:43 PM

The plan will always be downstream of intent though. At least in git you can track the evolution of the plan over time and hopefully annotate the rationale for changes in direction.

by hrimfaxi

3/27/2026 at 2:13:51 PM

Fair point. Git helps track how the plan changes, but it doesn’t always capture the original intent behind it.

by sibtain1997

3/27/2026 at 3:08:23 PM

I’m glad someone is finally bringing this up, thank you!

What are you suggesting instead? To share the prompt in order to capture the intent? Usually I expect the plan to reflect the prompt.

I find it interesting when I create a PR after a quick session: the description really captures the intent instead of focusing on the actual implementation. I think it’s because the context is still intact, and that’s very useful.

by axldelafosse

3/31/2026 at 7:12:43 PM

[dead]

by sibtain1997

3/27/2026 at 10:04:52 AM

You can also create a skill for reviewing (which calls gemini/codex as a command line tool) and set instructions on how and when to use. Very flexible.

by divan

3/27/2026 at 3:12:43 PM

Yes, but what’s cool about loop is that it runs the interactive TUIs and establishes a direct connection between them. You can steer and answer questions in both sessions, not just from the main worker.

by axldelafosse

3/27/2026 at 2:21:27 PM

I've always wondered what it would be like if we reversed the roles. I remember people claiming they had gotten better results if an agent started asking the questions.

What if we had an agent-to-agent network that contacted the human as a source of truth whenever they needed it. Keep a list of employees that are experts in said skill, then let them answer 1-2 questions.

Or are we speeding up our replacement like this?

by ramon156

3/27/2026 at 2:35:48 PM

What if a search engine queried YOU? That’s the question (I guess).

by pancsta

3/27/2026 at 3:02:06 PM

That's how they worked in Soviet Russia, right?

by pkaeding

3/27/2026 at 2:33:25 PM

I had a prototype where the agent primarily worked autonomously, and it could solicit human feedback via tool calls. But yes that pattern does feel like it is hastening the apocalypse.

by bulletsvshumans

3/27/2026 at 2:53:23 PM

Interesting! Feels like a good way to write docs/memories. What I like about loop is that it runs the interactive TUIs so you can answer questions in both sessions (not just the main worker). It is not human multiplayer though (but that would be cool).

I like this idea, I’ll experiment with it as part of a brainstorming skill to make the agents ask clarifying questions (to each other and to the human in the loop).

by axldelafosse

3/27/2026 at 2:36:24 PM

Hint: how did these humans become experts in that area? Definitely not by using AI.

by rootnod3

3/27/2026 at 3:11:14 AM

I prefer claude for generation / creativity, codex for bull-headed, accurate complaining and audit. Very rarely claude just doesn't "get it" and it makes sense to have codex direct edit. But generally I think it's happiest and best used complaining.

by vessenes

3/27/2026 at 9:15:02 AM

I think the A2A space is wide open. Great to see this approach using App Server and Channels. I tried built something similar (at a high level) for a more B2C use case for OpenClaw https://github.com/agentlink-dev/agentlink users. Currently I think the major Agents have not fully owned the "wake the Agent" use case fully. Regardless this is a very cool approach. All the best.

by rsafaya

3/27/2026 at 3:09:30 PM

Cool! Thank you.

by axldelafosse

3/27/2026 at 1:58:52 PM

"Letting the agents loop can result in more changes than expected, which are usually welcome..."

If "more changes than expected" means "out of scope", then I disagree. Those types of changes are exactly one of the things that's best to avoid whether code is being written by a person or an LLM.

by etothet

3/27/2026 at 2:26:29 PM

It doesn’t mean that they are always out of scope, rather than the reviewer can be nitpicking (like humans do) and instead of addressing the comment in a follow-up PR, the change gets addressed in the same PR. So not necessarily out of scope, but it can add up and make it harder for a human to review.

That’s why I’m wondering if we should instruct the agents to act more like humans would: if the change can be done in a follow-up PR, this is probably what an experienced engineer would do.

by axldelafosse

3/27/2026 at 3:09:49 AM

I systematically use reviewers agents in Swival: https://swival.dev/pages/reviews.html

Even with the same model (--self-review), that makes a huge difference, and immediately highlights how bad the first iterations of an LLM output can be.

by jedisct1

3/27/2026 at 3:46:17 AM

[flagged]

by AbanoubRodolf

3/27/2026 at 2:00:16 PM

This is very reminiscent of the review-loop Claude Code plugin.

https://github.com/hamelsmu/claude-review-loop

by woadwarrior01

3/27/2026 at 2:41:06 PM

Yes, the goal is the same. This plugin is similar to the Codex gstack skill.

What makes loop different is that it lets Claude and Codex talk to each other directly (receiving messages from Claude via the Codex App Server and from Codex via Claude Code Channels). I believe this approach works even better than having one agent spawn the other and wait for its output, or read/write to a shared file.

by axldelafosse

3/27/2026 at 5:49:46 AM

JDS wrote about this https://jdsemrau.substack.com/p/pair-programming-superbill-w...

by ph4rsikal

3/27/2026 at 7:41:26 AM

This is interesting for code, but I'm curious about agent-to-agent coordination for ops tasks — like one agent detecting a database anomaly and another auto-remediating it

by shreyssh

3/27/2026 at 8:15:22 AM

I think a lot of people/companies are integrating workflows like that, it's just separate from the point of agent pair coding.

The interesting thing here is agents working together to be better at a single task. Not agents integrated in a workflow. There's a lot of opportunity in "if this then that" scenarios that has nothing to do with two agents communicating on one single element of a problem, it's just Agent detect -> agent solve (-> Agent review? Agent deploy? Etc.)

by highphive

3/27/2026 at 4:05:41 AM

Multi turn review of code written by cc reviewed by codex works pretty well. Been one of the only ways to be able to deliver larger scoped features without constant bugs. I've seen them do 10-15 rounds of fix and review until complete.

Also implemented this as a gh action, works well for sentry to gh to auto triage to fix pr.

by bradfox2

3/27/2026 at 4:45:58 AM

Yes I’ve had a lot of success with this too. I found with prompt tightening I seldom do more than 5 rounds now, but it also does an explicit plan step with plan review.

Currently I’m authoring with codex and reviewing with opus.

by encoderer

3/27/2026 at 4:52:26 AM

Good reminder: don't forget the plan review!

by axldelafosse

3/27/2026 at 6:55:11 AM

How do you do this? Are you just switching between clis? Or is there a tool that uses the models in that way?

by _ink_

3/27/2026 at 4:26:14 PM

Is there a prize yet for the most absurd application of AI? Pair programming seems a fair first step in the quest for this holiest of grails. How about an agentic implementation of the House of AI Lords?

by zombot

3/27/2026 at 10:20:09 AM

The circle of slop.

by dude250711

3/27/2026 at 2:02:53 PM

Let's burn the planet twice faster while doubling our token costs.

by xeyownt

3/27/2026 at 12:31:12 PM

[dead]

by reachsmith

3/27/2026 at 9:44:52 AM

[dead]

by chattermate

3/27/2026 at 11:52:16 AM

[dead]

by maxbeech

3/27/2026 at 7:16:49 AM

[dead]

by hikaru_ai

3/27/2026 at 7:20:17 AM

[flagged]

by kevinbaiv

3/27/2026 at 11:26:53 AM

[flagged]

by mergeshield

3/27/2026 at 7:28:41 AM

[dead]

by elicohen1000

3/27/2026 at 3:01:08 PM

[dead]

by cestivan

3/27/2026 at 5:54:20 AM

[dead]

by vacancy892

3/27/2026 at 1:45:13 PM

[flagged]

by georaa

3/27/2026 at 2:33:32 PM

[flagged]

by d0963319287