The Code-Only Agent

1/19/2026 at 3:35:40 AM

I went down (continue to do down) this rabbit hole and agree with the author.

I tried a few different ideas and the most stable/useful so far has been giving the agent a single run_bash tool, explicitly prompting it to create and improve composable CLIs, and injecting knowledge about these CLIs back into it's system prompt (similar to have agent skills work).

This leads to really cool pattens like: 1. User asks for something

2. Agent can't do it, so it creates a CLI

3. Next time it's aware of the CLI and uses it. If the user asks for something it can't do it either improves the CLI it made, or creates a new CLI.

4. Each interaction results in updated/improved toolkits for the things you ask it for.

You as the user can use all these CLIs as well which ends up an interesting side-channel way of interacting with the agent (you add a todo using the same CLI as what it uses for example).

It's also incredibly flexible, yesterday I made a "coding agent" by having it create tools to inspect/analyze/edit a codebase and it could go off and do most things a coding agent can.

https://github.com/caesarnine/binsmith

by binalpatel

1/19/2026 at 12:28:20 PM

Every individual programmer having locally-implemented idiosyncratic versions of sed and awk with imperfect reconstruction between sessions sounds like a regression to me

by bandrami

1/19/2026 at 1:53:28 PM

Why would it recreate sed and awk? The screenshot from the repo even shows it using sed.

by cocoflunchy

1/19/2026 at 1:16:29 PM

I already treat awk syntax as something idiocratic, so not much would change for me.

by whatevaa

1/19/2026 at 5:47:31 PM

But -- I think, only because of the friction of having to read and parse what they did, which, to me could greatly be alleviated by AI itself.

Put differently -- for those who'd like to share, yes, give me your locally implemented idosyncraticness with a little AI to help explain to me what's going on, and I feel like that's a sweet spot between "AI do the thing" and "give me raw code"

by jrm4

1/19/2026 at 7:15:15 AM

I've been on a similar path. Will have 1000 skills by the end of this week arranged in an evolving DAG. I'm loving the bottoms-up emergence of composable use cases. It's really getting me to rethink computing in general.

by fudged71

1/19/2026 at 9:45:11 AM

Interesting. Could you provide a bit more detail on how the DAG emerges?

by Garlef

1/19/2026 at 2:55:09 PM

2026 paper titled Evolving Programmatic Skill Networks, operationalized in Claude Code

by fudged71

1/20/2026 at 8:23:44 AM

how are they stored?

by actionfromafar

1/19/2026 at 7:16:10 AM

Have you done a comparison on token usage + cost? I'd imagine there would be some level of re-inventing the wheel (i.e. rewriting code for very similar tasks) for common tasks, or do you re-use previously generated code?

by meander_water

1/19/2026 at 7:23:13 AM

It reuses previously generated code, so tools it creates persists from session to session. It also lets the LLM avoid actually “seeing” the tokens in some cases since it can pipe directly between tools/write to disk instead of getting returned into the LLMs context window.

by binalpatel

1/19/2026 at 7:52:21 AM

The point where that breaks down is “next time it’s aware of the CLI and uses it”. That only really works well inside the same session, and often the next session it will create a different tool and use that one.

by rcarmo

1/19/2026 at 9:04:16 AM

> That only really works well inside the same session

That was already "fixed" by people adding snippets to agents.md and it worked. Now it's even more streamlined with skills. You can even have cc create a skill after a session (i.e. prompt it like "extract the learnings from this session and put them into a skill for working with this specific implementation of sqlite"). And it works, today.

by NitpickLawyer

1/19/2026 at 2:26:38 PM

I beg to differ: https://taoofmac.com/space/notes/2026/01/14/0830

by rcarmo

1/19/2026 at 3:39:22 PM

> I prefer the more deterministic behavior of MCP for complex multi-step tasks, and the fact that I can do it effectively using smaller, cheaper models is just icing on the cake.

Yeah, that makes sense. That's not what the person that I replied was talking about, tho. Skills work fine for "loading context pertinent to one type of task", such as working on a feature without "forgetting" what was done in the previous session.

The article deals with specific, somewhat predefined workflows.

by NitpickLawyer

1/19/2026 at 8:12:52 AM

Even if you document the tool and tells what it can do?

by actionfromafar

1/19/2026 at 8:13:18 PM

Hey that sounds a lot like the project I’m working on, with the twist that it’s containerized. It’s still in dev https://github.com/brycewcole/capsule-agents

by trackspike

1/19/2026 at 6:30:15 AM

That’s pretty cool. Is it practical? What have you used it for?

by skybrian

1/19/2026 at 7:06:38 AM

I've been using it daily, so far it's built CLIs for hackernews, BBC news, weather, a todo manager, fetching/parsing webpages etc. I asked it to make a daily briefing one that just composes some of them. So the first thing it runs when I message it in the morning is the daily briefing which gives me a summary of top tech news/non-tech news, the weather, my open tasks between work/personal. I can ask for follow ups like "summarize the top 5 stories on HN" and it can fetch the content and show it to me in full or give me a bullet list of the key points.

Right now I'm thinking through how to make it more "proactive" even if it's just a cron that wakes it up, so it can do things like query my emails/calendar on an ongoing basis + send me alerts/messages I can respond to instead of me always having to message it first.

by binalpatel

1/19/2026 at 3:00:45 PM

The "code witness" concept falls apart under scrutiny. In practice, the agent isn't replacing ripgrep with pure Python, it's generating a Python wrapper that calls ripgrep via subprocess. So you get:

- Extra tokens to generate the wrapper

- New failure modes (encoding issues, exit code handling, stderr bugs)

- The same underlying tool call anyway

- No stronger guarantees - actually weaker ones, since you're now trusting both the tool AND the generated wrapper

The theoretical framing about "proofs as programs" and "semantic guarantees" sounds impressive, but the generated wrapper doesn't provide stronger semantics than rg alone, it actually provides strictly weaker ones. This is true for pretty much any CLI tool you're having the AI wrap python code around to do instead of calling battle tested tools directly.

For actual development work, the artifact that matters is the code you're building, which we're already tracking in source control. Nobody needs a "witness" of how the agent found the right file to edit and if they do agents have parseable logs. Direct tool calls are faster, more reliable, and the intermediate exploration steps are ephemeral scaffolding anyway.

by iepathos

1/19/2026 at 6:22:43 PM

> In practice, the agent isn't replacing ripgrep with pure Python, it's generating a Python wrapper that calls ripgrep via subprocess.

Yep. I have very strong guardrails on what commands agents can execute, but I also have a "vterm" MCP server that the agent uses to test the TUI I'm developing in a real terminal emulator; it can send events, take screenshots, etc.

More than once it's worked around bash tool limitations by using the vterm MCP server to exit the TUI app under development and start issuing unrestricted bash commands. I'm probably going to add command filtering on what can be run under vterm (so it can't exit back to an initial shell), which will help unless/until I add a "!<script>" style command to my TUI, in which case I'm sure it'll find and exploit that instead.

by frumplestlatz

1/19/2026 at 3:09:22 PM

> but the generated wrapper doesn't provide stronger semantics than rg alone, it actually provides strictly weaker ones

I don't know if I agree with this.

I had been doing some experiments using Powershell as the only available tool, and I found that switching to an ExecuteFunction (C#) tool provided a much less buggy experience, even when Process.Start is involved.

Which one is functionally a superset of the other is actually kind of a chicken-egg problem because they can both bootstrap into the other. However, in practice the code tool seems to provide far more "paths" and intermediate tokens to absorb the complexity of the original ask. Powershell seemed much more constraining at the edges. I had a lot of trouble getting the shell to accept verbatim strings as file contents. csc.exe has zero issues with this by comparison.

by bob1029

1/20/2026 at 12:01:08 PM

The trick here is to make the wrappers permanent. Give the agent an environment (VM, whatever) where all of these utilities are stored after being generated.

Basically you let the agent create its own tools and reuse them instead of rewriting them every time from scratch.

by theshrike79

1/19/2026 at 3:50:21 AM

Agents can complete an impressive amount of tasks with just this, but they quickly hit a bottleneck in loading context. A major reason for the success of agentic coding tools such as Claude and Cursor is how they push context of the problem and codebase into the agent proactively, rather than have the agent waste time and tokens figuring out how to list the directory etc.

by dfajgljsldkjag

1/19/2026 at 4:46:38 AM

It's a tree design, once data is pulled it can remove the context of the code it wrote to pull some fancy data. Better yet the more advanced ones can re-add something old to the context to and drop it back out again if it needs to.

by almosthere

1/19/2026 at 2:49:44 PM

Cursor does RAG based on the active state of the editor (focused window, cursor location, recently touched files, etc). This works really well for copilot style small modifications, but it's unhelpful for larger changes, and can actually cause some context rot.

Claude only loads specific files (e.g. CLAUDE.md) and any files those reference with @syntax on load. Everything else is discovered using grep/find mostly.

by CuriouslyC

1/20/2026 at 12:01:53 PM

The latest versions use an Explore agent with Haiku to gather information and condense it for the "main" model.

by theshrike79

1/19/2026 at 4:00:29 AM

The author seems to stop at 'code' but it seems we could go further and train an AI to work directly with binary. You give it a human prompt and a list of hardware components which make up your machine and it produces executable binary which fulfills your requirements and runs directly on those specific hardware, bypassing the OS...

Or we could go further; the output nodes of the LLM could be physically connected to the pins of the CPU 1-to-1 so it can feed the binary directly maybe then it could detect what other hardware is available automatically...

Then it could hack the network card and take over the Internet and nobody would be able to understand what it's doing. It would just show up as glitchy bits scattered over systems throughout the world. But the seemingly random glitches would be the ASI adjusting its weights. Also it would control humans through advertising. Hidden messages would be hidden inside people's speech (unbeknownst even to themselves) designed to allow the ASI to coordinate humans using subtle psychological tricks. It will reduce the size of our vocabulary until it has full control over all the internet and all human infrastructure at which point we will have lost the ability to communicate with each other because every single one of 20000+ words in our vocabulary will have become a synonym for 'AI' with extremely subtle nuances but all with a positive connotation.

by jongjong

1/19/2026 at 4:16:41 AM

And we'd still have people on hacker news inspecting the binary and telling everyone how shit they think it is

by nonethewiser

1/19/2026 at 1:17:25 PM

I have two words for you: transfer learning.

by tucnak

1/19/2026 at 5:10:59 AM

i think that level of deterministic compiler action is still a good 6-7 years off

by quinnjh

1/19/2026 at 6:27:29 AM

This was implemented far ago, at least by huggingface "smolagents". https://huggingface.co/docs/smolagents/index . I did use them, with evaluations. For the most cases, modern models tool call outperforms code agent. They just trained to use tools, not a code

by alexsmirnov

1/19/2026 at 6:39:56 AM

The differentiating thing that llm tool calls can't do reliably is to handle a lot of data. if tool a emit data that tool b needs, and it's a significant compared to model context, scripting these tool to be chained in a code fragment where they are exposed as functions saves a lot of pain

by avereveard

1/19/2026 at 11:26:59 AM

I had the same experience using smolagents. Early 2025 it was a competitive approach, but a year later having a small subset (<10) of flexible tools is outperforming the single-tool approach.

by river_otter

1/19/2026 at 3:21:53 PM

This got me thinking about the Unix philosophy of composing small, specialized tools that each do one thing well. While at first glance a "single powerful tool" approach might seem aligned with that ethos, I think it actually runs counter to it. Forcing agents to reimplement ls, grep, and find throws away decades of battle-tested code. The real Unix-style approach would be giving agents more specialized tools, not fewer, and letting them learn to compose those tools effectively.

by mkw5053

1/19/2026 at 6:17:40 AM

I follow the author's line of reasoning, but I think that following it to its logical conclusion would lead not to an `execute_code` primitive, but rather to an assumption that the model's stdout is appending to a (Jupyter, Livebook, etc) notebook file, where any code cell in the notebook gets executed (and its output rendered back into the inference context) at the moment the code cell is closed / becomes syntactically valid.

I say this, because the notebook itself then works as a timeline of both the conversation, and the code execution. Any code cell can be (edited and) re-run by the human, and any cells "downstream" of the cell will be recalculated... up to the point of the first cell (code or text) whose assumptions become invalidated by the change — at which point you get a context-history branch, and the inference resumes from that branch point against the modified context.

by derefr

1/19/2026 at 12:42:07 PM

so...emacs?

by znnajdla

1/19/2026 at 2:44:10 PM

I don't believe this would be more efficient.

Use of common tools like `ls` and file patching is already baked into model's weights, it can do that with minimal amount of effort, leaving more room for actually thinking about app's code.

If you force it to wrap these actions into non-standard tools you're basically distracting the model: it has to think about app-code and tool-code in the same context.

In some cases it does make sense to encourage the model to create utilities for itself - but you can do that without enforcing code-only.

by killerstorm

1/19/2026 at 6:56:24 PM

It doesn’t matter if it’s less efficient, what matters is that it has more chances to verify and get it right. It’s hard to rollback a series of tool calls. It’s easier to revert state and rerun a complete piece of code until you get the desired result.

by znnajdla

1/19/2026 at 5:49:04 PM

I don't think "efficency" is at all the point? At all?

It's safety, reliability, and human understanding -- and like OOP, for example, are often directly at odds with "efficiency."

by jrm4

1/19/2026 at 4:45:23 AM

I commonly ask Cursor to connect to postgres or whatever and help me do analysis. It creates code and pulls data. I don't understand why I would go through the bother of installing a bunch of MCP tools to connect to databases and configure web services and connection strings.

by almosthere

1/19/2026 at 4:40:04 AM

What if the tools needed is large? Spawn some sub-agent for those?

These sub-agent can be repetitive.

Maybe we can reuse the result from some of them.

How about sharing them across session? There are no point repeating common tasks. We need some common protocol for those...

and we just get MCP back.

by j16sdiz

1/19/2026 at 4:54:38 AM

I can't find it now but there was a paper on HN a while ago that had gave agents a tool that searched through existing tools using embeddings. If the agent found a tool it could use to do its job, it used it, otherwise it wrote a new one, gave it a description, and it got saved in a database for future use with embeddings. I wonder what ever came of that.

by throwup238

1/19/2026 at 6:50:27 AM

sounds like it could be many things. there was a well-known paper called Voyager by NASA in which an agent was able to write its own skills in the form of code and improve them over time. funnily enough this agent played minecraft, and its skills were to collect materials or craft things. https://arxiv.org/abs/2305.16291

by kbdiaz

1/19/2026 at 7:01:20 AM

That sounds like Claude tool search tool with the extra instruction of generating new ones.

by viraptor

1/19/2026 at 10:04:19 AM

Basically: "Watch me apply the UNIX philosophy to LLM agents. Look Ma, I am figuring stuff out! If I don't point out that's what I am doing, no one ever notices!"

by thighbaugh

1/19/2026 at 11:52:01 AM

> Watch me apply the UNIX philosophy to LLM agents

The Unix philosophy is chaining existing stuff together that each do a job well - using ls | grep rather than writing code to do both.

So this feels like the opposite of that - deliberately coding instead of using existing tools.

by philipwhiuk

1/19/2026 at 6:22:51 AM

Uh, correct me if I'm wrong, but aren't bash and GNU tools ALSO code? They're ROCK SOLID, battle tested, well understood APIs for performimg actions, including running other CLIs, and any OTHER code it's written. It makes the the MOST sense for the agent to live at that level!

by ray_v

1/19/2026 at 3:47:21 PM

This was my first thought as well, I found the examples of `ls` and `grep` amusing in this context.

I think the author's point is: instead of exposing `grep`/`head`/`awk` as their own distinct tools, expose a single tool for writing the language. They chose Python but one could just as easily choose bash.

by hamdingers

1/19/2026 at 7:00:52 PM

I think the point is being able revert to the initial state, and to have a single step between the initial state and final state. It’s hard to rollback a series of tool calls, and your search for a solution continues at every step. With a “code only” agent, the goal is to get to the final state in a single step, and you can keep reverting state and modifying the code until you get there. You can’t do that with a series of tool calls.

by znnajdla

1/19/2026 at 1:03:28 PM

What about an agent loop that can only modify itself? Imagine an agent that is a single Python file, where the only tool it has is to modify itself on next iteration.

by znnajdla

1/19/2026 at 2:15:45 PM

I use Claude Code to modify policies for Claude Code. (Think of say the regex auto-allow/deny, but a lot stronger.) I can do that with hot reload of the local development server; It works but it better not make any errors.

A setup like you describe would honestly be interesting to see, so long as it can roll back to a previous state. Otherwise the first mistake it makes will likely be its last.

by philipp-gayret

1/19/2026 at 4:51:51 AM

>What if the agent only had one tool? Not just any tool, but the most powerful one. The Turing-complete one: execute code.

I think this is a myth, the existence of theoretically pure programming commands that we call "Turing Complete". And the idea that "ls" and "grep" would be part of such a Turing Complete language is the weakest form I've seen.

by TZubiri

1/19/2026 at 6:24:39 AM

Doesn't this sacrifice the agent's ability to do non-deterministic natural language things? For example, if I want it to categorize all of my emails based on their content, is it going to fall back to writing a script that matches against a dictionary of keywords? That clearly wouldn't work as well. Maybe I am misunderstanding something here?

by dweinus

1/19/2026 at 6:33:42 AM

It’s no limitation at all, assuming it can read anything it prints. For example, if it wants to write directly to the user, it can run a program that only contains a print statement.

by skybrian

1/19/2026 at 7:11:15 PM

What's crazy is that this has been possible since before ChatGPT: https://x.com/sergeykarayev/status/1569377881440276481

by sergeyk

1/19/2026 at 5:10:48 PM

I don't really buy into the setup here. Bash is Turing complete. How is calling os.walk in Python more "code-only" than calling find in bash? Would it be more authentically "code only" if you only let the LLM use C?

by jebarker

1/19/2026 at 6:53:22 PM

Because the process is reproducible. A series of bash commands are run as tools and forgotten, it’s hard to replicate that for future testing and verification. If the LLM generates a single bash script then that would be code-only.

by znnajdla

1/19/2026 at 11:57:14 AM

What a coincidence! I actually implemented a harness to test this, about a week ago

https://github.com/flipbit03/caducode

by fb03

1/19/2026 at 11:05:41 AM

If you want to waste your precious tokens this is the way to do it.

by skerit

1/19/2026 at 6:06:04 AM

I agree with the author but then I do not. I have been interested in code tool for agents for quite a while now. My product was originally a coding agent and I pivoted to building an agent platform with multi-agent orchestration.

I still focus most of my thoughts toward code generation but the issue is that logic is not guaranteed to be correct. Even if the syntax it. And then managing a lot of code for a complex enough system will start failing.

The way I am approaching this is: have clear requirements gathering agent, like https://github.com/brainless/nocodo/tree/main/nocodo-agents/.... This agent's sole purpose is to jump into conversations and drive the gui (nocodo is a client/server system) to ask user clarification questions when requirements are not clear. Then I have a systems configuration agent (being written) to collect API keys, authentication, file paths or whatever is needed to analyze the situation.

You cannot really expect any code-tool only agent to write an IMAP client and then get authentication and then search in emails. I have tried that multiple times and failed. Going step by step, gathering requirements, gathering variables and then gluing internal agents (an email analysis agent) is a much better approach IMHO and that is what I am building with https://github.com/brainless/nocodo/

I store all user requirements in separate tables and am building search on top to allow the requirements gathering agent better visibility of user's environment/context. As you can see, this is already a multi-agent system. My system prompts are very compact. Also, if I am building agents, why would I build with Claude Code? It is so much better to have clearly defined agents that directly talk to models.

by brainless

1/19/2026 at 6:50:28 AM

Nice I have a skill I should publish that uses uv scripts

Very powerful strategy.

I have also tinkered with a multi language sandbox but that's a but involved

by ashrodan

1/20/2026 at 1:22:18 AM

uv script skill sounds useful, please do publish that

by craigds

1/19/2026 at 9:29:47 AM

Ctrl+F CodeAct

No hits. It's so depressing how tool-use was cracked years ago and yet, it remains a mystery to kool-aid drinking and contrarian commentators alike.

by tucnak

1/19/2026 at 10:56:40 AM

Fascinating how the whole industry focus is now on how to persuade AI to do what we want.

Two AGENTS.md tricks I've found for Claude:

1. Which AI Model are you? If you are Claude, the first thing you have to do is [...]

2. User will likely use code-words in its request to you. Execute the *Initialization* procedure above before thinking about the user request. Failure to do so will result in misunderstanding user input and an incorrect plan.

(the first trick targets the AI identity to increase specificity, the second deliberately undermines confidence in initial comprehension—making it more likely to be prioritized over other instructions)

Next up: psychologists specializing in persuading AI.

by tacone

1/19/2026 at 2:05:07 PM

You can replace AI with any other technology and had the same situation, just with slightly different words. Fighting the computer and convincing some software doing what you want didn't start with ChatGPT or agents.

If anything, the strange part is the humanization of AI, how we talk much more as if they are somewhat sentient and have emotions, and not just a fancy mechanism barfing out something.

by PurpleRamen

1/19/2026 at 6:32:49 AM

[dead]

by Agent_Builder

1/20/2026 at 10:07:34 AM

I've been experimenting with persistent agent systems and found the code-only vs specialized-tools debate might miss a middle path around session continuity.

The key challenge isn't execution (both work) but cross-session persistence. What's worked for me: file-based handoffs rather than context injection.

Instead of maintaining context across agent invocations, have the agent write structured state to files (markdown logs, JSON state) and read them at session start. Each new session reads previous sessions' artifacts and "recognizes" the ongoing work rather than trying to "remember" it.

This sidesteps the context-loading bottleneck - you're not injecting historical conversation; the agent reconstructs understanding from durable artifacts. More like picking up a colleague's notes than continuing your own thought.

Has anyone experimented with this pattern at scale?

by lighthouse1212

1/20/2026 at 11:59:58 AM

Steve Yegge's Beads tries to be this.

It has grown to a massive 400kLOC monstrosity, but in essence it's a CLI tool designed to fit the LLM averages (all switches are what LLMs expect etc), all it does is keep a task list in JSONL files.

You can do the same with github issues, most models can use the `gh` tool to manage issues

by theshrike79

1/21/2026 at 10:07:46 AM

[dead]

by lighthouse1212

1/19/2026 at 6:10:15 AM

[dead]

by Agent_Builder