5/9/2026 at 2:26:59 PM
I'm suspicious of their results with regards to tool usage.It's unsurprising that round-tripping long content through an LLM results in corruption. Frequent LLM users already know not to do that.
They claim that tool use didn't help, which surprised me... but they also said:
> To test this, we implemented a basic agentic harness (Yao et al., 2022) with file reading, writing, and code execution tools (Appendix M). We note this is not an optimized state-of-the-art agent system; future work could explore more sophisticated harnesses.
And yeah, their basic harness consists of read_file() and write_file() - that's just round-tripping with an extra step!
The modern coding agent harnesses put a LOT of work into the design of their tools for editing files. My favorite current example of that is the Claude edit suite described here: https://platform.claude.com/docs/en/agents-and-tools/tool-us...
The str_replace and insert commands are essential for avoiding round-trip risky edits of the whole file.
They do at least provide a run_python() tool, so it's possible the better models figured out how to run string replacement using that. I'd like to see their system prompt and if it encouraged Python-based manipulation over reading and then writing the file.
Update: found that harness code here https://github.com/microsoft/delegate52/blob/main/model_agen...
The relevant prompt fragment is:
You can approach the task in whatever
way you find most effective:
programmatically or directly
by writing files
As with so many papers like this, the results of the paper reflect more on the design of the harness that the paper's authors used than on the models themselves.I'm confident an experienced AI engineer / prompt engineer / pick your preferred title could get better results on this test by iterating on the harness itself.
by simonw
5/9/2026 at 7:26:21 PM
I agree with most of what you wrote except for this:>Frequent LLM users already know not to do that.
And I think that’s the biggest problem. Amidst the current push to utilize LLMs across orgs and groups there are a large (if even say majority) of people that are using them every day but who have never approached anything as technical as a “harness” before let alone an entire setup.
For them the behavior mentioned here is a major issue.
by ofjcihen
5/10/2026 at 3:10:11 AM
Exactly. When I use a scissor, I don't want the scissor to not work just because I'm not a "frequent scissor user," and then get told by someone who makes their breakfast with scissors that I'm doing it wrong. Most people will not be "frequent" anything users.by AlienRobot
5/10/2026 at 8:42:31 AM
Most people also understand that, because they're not "frequent" users of a thing, they absolutely suck at using it, and set their expectations accordingly. In particular, they realize that doing anything non-trivial with the thing requires them to spend some learning and practice time, or asking/hiring a "frequent" user to do it for them.So the reasonable response to being told you're holding your scissors wrong is to realize that yes, you most likely are holding your scissors wrong[0], and ask the other person for advice (or just to do the thing), or look up a YouTube video and learn, or sign up to a class, or such.
Expecting mastery in 30 seconds is not a reasonable attitude, but it's unfortunately the lie that software industry tried to sell to people for the past 15 years or so.
--
[0] - There's much more to it than one would think.
by TeMPOraL
5/10/2026 at 9:39:01 AM
I’m interested in the “non-trivial” point as well, this seems to be a common refrain from the anti-LLM tech crowd, “LLMs aren’t good at doing anything non-trivial”, well is that really the case or is it just harder and one needs to put in more practice for more complicated tasks?I don’t have an example off hand, but I know that it’s easy to dismiss something an LLM does as trivial if your work is extremely marginal. Most devs aren’t creating their own programming languages. I can’t help but think people who hold this opinion also think the work most software professionals do is “trivial” (“you’re just moving strings around, that’s not impressive/trivial”)
by wfurney
5/10/2026 at 12:42:24 PM
On the one hand there's simonw's concept of a "frequent LLM user" and then there's the actual vast majority of people using ChatGPT web app or one of the various Office CoPilots.by fluidcruft
5/11/2026 at 3:12:30 PM
I should have said "frequent, expert LLM user".by simonw
5/10/2026 at 9:20:51 AM
If you make the example any more complicated, it makes sense.A lathe operator isn’t any good if they don’t frequently operate lathes.
A articulated robot implementer needs frequent experience implementing robots to be any good.
That doesn’t mean lathes or robots are useless. Nor does it mean they have failed as products because they require expertise.
I do think it raises questions as to whether vast swathes of the population will be effective at using LLMs. Are they scissors, or a lathe?
by mediaman
5/10/2026 at 9:34:05 AM
Everybody seems to want them to be scissors, or at least to treat them as such, but even still the reason everyone can use scissors so well is because they’ve practiced with them, right? You’re probably a lot better at using scissors now than the first time you did it, the functionality is just so simple it’s harder to notice.To me learning to use LLMs is the same as doing anything else, you have to practice and put in the hours to get good. Maybe some harnesses will eventually allow LLMs to function more as scissors than lathes. This seems to be what Microsoft is trying to do by embedding Copilot in all their products and saying “choose the UI that works best for you”. If that doesn’t end up working we’ll need another paradigm for “non-technical” users to effectively operate computer assistants
by wfurney
5/10/2026 at 8:17:15 PM
I figure english is the next coolest programming language for scripting and compilation. So far people have been writing fun little demos with it, but now people are starting to place real demands on it, and you're starting to see actual programs needing to be built. Unsurprisingly this requires a bit more craft.by Kim_Bruning
5/10/2026 at 9:51:51 PM
> I figure english is the next coolest programming language for scripting and compilation. So far people have been writing fun little demos with it, but now people are starting to place real demands on it, and you're starting to see actual programs needing to be built. Unsurprisingly this requires a bit more craft.Perhaps that craft of using the exact subset of English has something to do with the correct selection of words and concise, yet expressive enough, expressions, in a fashion resembling creating a code.
A code that's meant to be understood by machines, we could call it "computer code". And said computer code could be used to create recipes, algorithms, let's call them "programs". Hey, I think I have ideas for 2 possible names for this process!
by oblio
5/11/2026 at 2:56:01 AM
No wait, you think I'm being silly so that's why you're being a bit sarcastic back.But seriously, you can put a shebang on an english text file now (if you're sufficiently brave), or feed it through something that spits out code on the other end (so you can proof read the consequences before executing them).
It's crazy, but this is 2026, and that actually ... just works. You can even do it locally, if you don't mind running a space heater.
Thing is, when you have the expressiveness and power of a full natural language (and you're already paying for it), why would you want to constrain yourself to a subset? That's not very practical. Why not use all of it? Computing was never about typing code into machines anyway. "Computer" used to be a human profession, until it got automated.
On the upside, there's thousands of years of documentation. On the downside, a lot said documentation is underspecified and/or straight wishful thinking. It's certainly an interesting avenue to explore.
by Kim_Bruning
5/11/2026 at 7:29:50 AM
> why would you want to constrain yourself to a subset? That's not very practical. Why not use all of it?For the same reason math, physics, chemistry, etc figured out a long time ago that Koine Greek, Latin, French, German, English, etc aren't the best languages for science. Constraint gives focus, precision.
If you code novels, knock yourself out.
by oblio
5/11/2026 at 12:30:42 PM
Let's actually look at this as if I'm serious for a second? Tell me this framing really can't work.None-exhaustively:
Python has if, for, while, def, class and first class lists, dicts , functions ;
Forth has this stack machine concept, RPN, compilation-in-the-REPL when defining new functions.
Lisp has this code is data is code concept, and CAR, CDR, first class lists (obviously ), first class functions (in some of them) ... etc.
Machine code can (theoretically) be directly expressed in logic gates.
How about a quick look at what English supports:
Conditionals, iteration, abstraction, composition, delegation, exception handling, scope, naming, modularity; intent, priority, graduated precision, analogy, context-dependence; And.. the concept of semantic triples is built in as a syntactic primitive (subject-predicate-object), so you can even do a bunch of GOFAI right off the bat.
It's weird thinking of english as a programming language. But it kind of works like one if you want to, and computers can process it now?
by Kim_Bruning
5/11/2026 at 1:41:25 PM
I'm not saying English (or any other natural language) is not usable. It is, since it's a more complex language than programming language. All natural languages are supersets of current programming languages.I'm talking about the opposite problem: these supersets are ambiguous, contradictory, vague. At the end of the day the thing that is programmed needs to be clear, unambiguous and ideally concise, too (performance in its million incarnations).
So yeah, I guess you can fix the ambiguous aspect with verbosity. Just write more words until you define everything you need to define more directly when using a formal language.
I would be extremely shocked if programming wouldn't require knowing a very specific, albeit huge, domain jargon.
by oblio
5/11/2026 at 2:23:43 PM
Question isn't per-se "is English a great language to write the next sorting algorithm in?" . Probably not. Rust is quicker , and cheaper to execute besides. But there's entire classes of problem that English might be more useful for.English assumes the target is an agent with memory/state in a given context. Ambiguity, verbosity, noise is strongly reduced by means of modelling the other agent's state, then only transmitting the required state diff. The receiver decodes by comparing the diff against the other side's predicted state and updating. [1] This kind of protocol would obviously be NUTS to build from scratch if you went about it as an engineer I'd think. But we have the hardware and software preinstalled in humans , and now my 3090 can run an (imperfect, but viable) decoder
Is it useful? Yeah, I think it actually is. English is able to encode things that are ambiguous, contradictory, vague... and get useful results. Not always; maybe not even often. As you say, skill required, but the option is there. Formal languages just crash.
It's interesting is what I'd call it.
[1] see also: Clark & Brennan's grounding theory in linguistics; Predictive coding in neuroscience; Delta encoding in compression; and Theory of mind in cognitive science. They all dance around the same shape, so this is roughly accurate I think.
by Kim_Bruning
5/11/2026 at 11:50:16 AM
In a semi-random sample of 10 recent articles on arxiv.org, 10 articles (100%) contained english language as the predominant part of the corpus. Where necessary mathematical notation was included.So - you're not wrong that eg. mathematical notation is (often) used, as we both very well know. But English is really quite prominent!
And now computers can process both, where before they couldn't.
The engineering doesn't go away, not yet. Decomposition, abstraction, state management, blast radius containment O:-) . But now you can express much more of that in the language the arxiv papers are already written in.
by Kim_Bruning
5/11/2026 at 6:53:05 PM
> But seriously, you can put a shebang on an english text file now (if you're sufficiently brave)That inspired me to figure out how to do exactly that:
https://til.simonwillison.net/llms/llm-shebang
#!/usr/bin/env -S llm -f
Generate an SVG of a pelican riding a bicycle
Thanks for the inspiration!
by simonw
5/12/2026 at 7:40:59 PM
A .llm file extension might be in order :)by wfurney
5/11/2026 at 10:45:40 PM
Oh, that looks pretty clean!by Kim_Bruning
5/11/2026 at 9:34:00 PM
> But seriously, you can put a shebang on an english text file now (if you're sufficiently brave), or feed it through something that spits out code on the other end (so you can proof read the consequences before executing them).The funky thing is that it's not just English. I could vibe code in Romanian, it would probably be hilarious :-)) Probably not for whomever would have to take over the app, though.
by oblio
5/12/2026 at 6:51:50 PM
Eh? You should go for it! Do everything at least once, right? Pick some simple pet project, and get it off your bucket list!If it wasn't on your bucket list to begin with, who cares: Now you can add it and complete it in one fell swoop ;-)
by Kim_Bruning
5/10/2026 at 7:45:01 PM
Scissors tend to be reasonably straightforward, so there your analogy seems to hold, but upgrading to chopsticks, needles, and chainsaws tends to attract increasing amounts of "you're doing it wrong" in increasingly alarmed tones of voice.by Kim_Bruning
5/9/2026 at 9:04:35 PM
Exactly - I am a lawyer and we are told to use dedicated AI products as much and however we want. There will be errors madeby Sprotch
5/9/2026 at 11:07:54 PM
Much to the often-reported chagrin of judges across the country.by rockskon
5/9/2026 at 7:46:51 PM
Only sort of related, but I would love to see a harness with ed as the primary file editing / reading tool. Half the bash Claude runs seems to be sed anyway, having some state persist in ed would seem to help.What does one do when a full editor consumes too much bandwidth^H tokens? Use ed, the standard editor!
by kristjansson
5/10/2026 at 2:22:02 PM
I'm not sure you understand how those terminal programs are rendered - but the amount of control code data sent to Claude would be way, way more than using command line sed.by nullsanity
5/11/2026 at 4:19:58 PM
You may want to take a look at `man ed`, `info ed`, or [0]. ed is many things, but verbose is not one of them.In particular, it's designed for the teletype era, when (a) the user would have a trace of all the commands they'd sent and output they'd received, since it was literally printed on paper, and (b) output was literally printed on paper, and so had a direct, non-negligible cost.
This is more or less exactly the situation LLMs find themselves in. they can attend to ~all the prior output in their context window, but there's a direct cost to adding new symbols to context.
We've got a tool for exactly that setting, so it would be fun to try it!
by kristjansson
5/10/2026 at 7:01:36 AM
It's worth noting that Claude Code itself doesn't use the `insert` tool. (It also uses custom edit tool not the suite's predefined str_replace)Also as a person developing agentic code tools since before Claude Code, I'm skeptical if str_replace provides accuracy improvement over just full rewrite.
Back in the day when SOTA models would do lazy coding like `// ... rest of the code ...`, full rewrite wasn't easy. Search/replace was fast, efficient and without the lazy coding. However, it came with slight accuracy drop.
Today that accuracy drop might be minimal/absent, but I'm not sure if it could lead to improvements like preventing doc corruption.
by pcwelder
5/10/2026 at 8:15:41 AM
I've tested this extensively in a workflow (not agentic) context, and you're right, the underlying models are both good at full rewrite of code files, and at doing search/replace.They've been decent at full rewrite for 2 years. I don't think they were good at search/replace until a year ago, but I'm not so sure.
It's true that the models 2 years ago would sometimes make errors in whole rewrite - e.g removing comments was fairly common. But I've never seen one randomly remove one character or anything like that. These days they're really good.
Main reason agentic harnesses use search/replace is speed and cost, surely! Whole file output is expensive for small changes.
by frabcus
5/10/2026 at 9:11:59 AM
I think your argument makes sense but my understanding is that adding the document to the context and spitting it back is prone to corruption in any scenario.I think this is closely related to other sources saying that even if you have huge context the attention mechanism itself is not back referencing thus any tasks related to bigger contexts are prone to errors.
because I have some preconception of this maybe I am assuming this is what they were saying. Am I missing something ?
by motbus3
5/9/2026 at 2:44:54 PM
People love to interpret the results in the most negative way possible because it's a threat to their occupation and identity. I refer to HN specifically.The fact of the matter is, if you want to edit a document by reading the document and then regurgitating the entire document with said edits... a human will DO worse then a 25% degradation. It's possible for a human to achieve 0% degradation but the human will have to ingest the document hundreds of times to achieve a state called "memorization". The equivalent in an LLM is called training. If you train a document into an LLM you can get parity with the memorized human edit in this case.
But the above is irrelevant. The point is LLMs have certain similarities with humans. You need to design a harness such that an LLM edits a document the same way a human would: Search and surgical edits. All coding agents edit this way, so this paper isn't relevant.
by threethirtytwo
5/9/2026 at 6:49:38 PM
> People love to interpret the results in the most negative way possible because it's a threat to their occupation and identity.OR it could be because their concerns are genuine but are ignored in favour of a good sounding story.
by shahbaby
5/9/2026 at 8:52:23 PM
But no one in this thread addressed the inaccuracy of the experiment. The experiment did not test the actuality of HOW LLMs are used in reality.So that is definitively a biased interpretation. This is independent of how accurate my POV or your POV is on whether LLMs degrade documents. I am simply saying the experiment conducted is COMPLETELY DIFFERENT from how LLMs AND humans edit papers.
by threethirtytwo
5/9/2026 at 7:15:30 PM
[dead]by redsocksfan45
5/9/2026 at 11:44:15 PM
> a human will DO worse then a 25% degradation.* than
by ActionHank
5/10/2026 at 12:49:34 AM
See that’s an example of degradation by a human. Not even an LLM wil make that kinda mistake.by threethirtytwo
5/9/2026 at 3:53:13 PM
[flagged]by ieieue
5/9/2026 at 8:07:42 PM
> a human will DO worse then a 25% degradationAs I was reading this article, a similar thought occurred to me: "I wonder if that's better or worse than a human?" Unfortunately, there was no human baseline in this study. That said, there are studies that compare LLM to human performance. Usually, humans perform much better (like 5-7x better) at long-running tasks.
In other words, a human would probably do better than an LLM on this task.
Humans lose to LLMs in narrow, well-specified text/symbolic reasoning tasks where the model can exploit breadth, speed, and search. Usually, the LLM performed ~15% better than humans, but I saw studies that were as high as 80%. To my surprise, these studies were usually about "soft skills" like creativity and persuasion.
by tieTYT
5/9/2026 at 8:54:30 PM
You can do a baseline study right now. Read this entire thread and make an edit of changing every E to an I.Show your edit by regurgitating this entire thread by hand on a paper. Don't use any additional tools like Find and replace.
Boom there's your baseline. I can simulate the result in my head.
Guys I'm basically saying the experiment is innaccurate to the practical reality of how LLMs are actually used.
by threethirtytwo
5/11/2026 at 5:08:43 PM
> The modern coding agent harnesses put a LOT of work into the design of their tools for editing files. My favorite current example of that is the Claude edit suite [...]Meanwhile, Claude Code extracting a part of a function, moving it to a different file, etc will corrupt your source code, just like the paper says. This is most noticeable as comments disappearing.
We need tooling with copy-paste/cut-paste style functionality, to avoid the LLM round-trip.
by yencabulator
5/10/2026 at 4:06:12 AM
Any rando can publish research nowadays. It means nothing. Just like "X country published N research papers last year". It is noise. In a world where it was required to attach age, experience level, and country of origin to every comment, research paper, or post on the internet, it would shatter the conviction we mistakenly have towards the information we receive.This team is inexperienced and it shows.
The noise to signal ratio will get worse, even in "academia". Brace yourselves. The kids are growing up in this new world.
by Art9681
5/9/2026 at 11:48:55 PM
It could also be that much like most large orgs now you've made LLMs your entire personality, so you don't see the inherent bias.Most LLM users who are not touching code are certainly not going to be using a harness. They're going to take all the documents, slam all those tokens into the context window, see they have only used 500k out of their 1M tokens and say "summarize".
by ActionHank
5/10/2026 at 1:04:28 AM
Wouldn't they be more likely to give ChatGPT access to a Google Drive folder or some such? The tools the agent has for editing documents will be whatever the app they used implemented.by skybrian
5/9/2026 at 11:59:46 PM
Yeah, this is a bit of a strawman of an LLM task.On editing tasks, one should only allow programmatic editing commands, the text shouldn't flow through the LLM at all. The LLM should analyze the text and emit commands to achieve a feedback directed goal.
by genxy
5/10/2026 at 11:20:54 AM
[flagged]by ultrathink-er
5/9/2026 at 7:26:29 PM
[flagged]by rs545837
5/9/2026 at 8:21:35 PM
[dead]by javajive
5/9/2026 at 6:00:57 PM
The incomprehensible methodology due to resource constraints or straight up for simplicity's sake make these papers worthless unfortunatelyby alansaber