LLMs corrupt your documents when you delegate

5/9/2026 at 2:26:59 PM

I'm suspicious of their results with regards to tool usage.

It's unsurprising that round-tripping long content through an LLM results in corruption. Frequent LLM users already know not to do that.

They claim that tool use didn't help, which surprised me... but they also said:

> To test this, we implemented a basic agentic harness (Yao et al., 2022) with file reading, writing, and code execution tools (Appendix M). We note this is not an optimized state-of-the-art agent system; future work could explore more sophisticated harnesses.

And yeah, their basic harness consists of read_file() and write_file() - that's just round-tripping with an extra step!

The modern coding agent harnesses put a LOT of work into the design of their tools for editing files. My favorite current example of that is the Claude edit suite described here: https://platform.claude.com/docs/en/agents-and-tools/tool-us...

The str_replace and insert commands are essential for avoiding round-trip risky edits of the whole file.

They do at least provide a run_python() tool, so it's possible the better models figured out how to run string replacement using that. I'd like to see their system prompt and if it encouraged Python-based manipulation over reading and then writing the file.

Update: found that harness code here https://github.com/microsoft/delegate52/blob/main/model_agen...

The relevant prompt fragment is:

  You can approach the task in whatever
  way you find most effective:
  programmatically or directly
  by writing files

As with so many papers like this, the results of the paper reflect more on the design of the harness that the paper's authors used than on the models themselves.

I'm confident an experienced AI engineer / prompt engineer / pick your preferred title could get better results on this test by iterating on the harness itself.

by simonw

5/9/2026 at 7:26:21 PM

I agree with most of what you wrote except for this:

>Frequent LLM users already know not to do that.

And I think that’s the biggest problem. Amidst the current push to utilize LLMs across orgs and groups there are a large (if even say majority) of people that are using them every day but who have never approached anything as technical as a “harness” before let alone an entire setup.

For them the behavior mentioned here is a major issue.

by ofjcihen

5/10/2026 at 3:10:11 AM

Exactly. When I use a scissor, I don't want the scissor to not work just because I'm not a "frequent scissor user," and then get told by someone who makes their breakfast with scissors that I'm doing it wrong. Most people will not be "frequent" anything users.

by AlienRobot

5/10/2026 at 8:42:31 AM

Most people also understand that, because they're not "frequent" users of a thing, they absolutely suck at using it, and set their expectations accordingly. In particular, they realize that doing anything non-trivial with the thing requires them to spend some learning and practice time, or asking/hiring a "frequent" user to do it for them.

So the reasonable response to being told you're holding your scissors wrong is to realize that yes, you most likely are holding your scissors wrong[0], and ask the other person for advice (or just to do the thing), or look up a YouTube video and learn, or sign up to a class, or such.

Expecting mastery in 30 seconds is not a reasonable attitude, but it's unfortunately the lie that software industry tried to sell to people for the past 15 years or so.

[0] - There's much more to it than one would think.

by TeMPOraL

5/10/2026 at 9:39:01 AM

I’m interested in the “non-trivial” point as well, this seems to be a common refrain from the anti-LLM tech crowd, “LLMs aren’t good at doing anything non-trivial”, well is that really the case or is it just harder and one needs to put in more practice for more complicated tasks?

I don’t have an example off hand, but I know that it’s easy to dismiss something an LLM does as trivial if your work is extremely marginal. Most devs aren’t creating their own programming languages. I can’t help but think people who hold this opinion also think the work most software professionals do is “trivial” (“you’re just moving strings around, that’s not impressive/trivial”)

by wfurney

5/10/2026 at 12:42:24 PM

On the one hand there's simonw's concept of a "frequent LLM user" and then there's the actual vast majority of people using ChatGPT web app or one of the various Office CoPilots.

by fluidcruft

5/11/2026 at 3:12:30 PM

I should have said "frequent, expert LLM user".

by simonw

5/10/2026 at 9:20:51 AM

If you make the example any more complicated, it makes sense.

A lathe operator isn’t any good if they don’t frequently operate lathes.

A articulated robot implementer needs frequent experience implementing robots to be any good.

That doesn’t mean lathes or robots are useless. Nor does it mean they have failed as products because they require expertise.

I do think it raises questions as to whether vast swathes of the population will be effective at using LLMs. Are they scissors, or a lathe?

by mediaman

5/10/2026 at 9:34:05 AM

Everybody seems to want them to be scissors, or at least to treat them as such, but even still the reason everyone can use scissors so well is because they’ve practiced with them, right? You’re probably a lot better at using scissors now than the first time you did it, the functionality is just so simple it’s harder to notice.

To me learning to use LLMs is the same as doing anything else, you have to practice and put in the hours to get good. Maybe some harnesses will eventually allow LLMs to function more as scissors than lathes. This seems to be what Microsoft is trying to do by embedding Copilot in all their products and saying “choose the UI that works best for you”. If that doesn’t end up working we’ll need another paradigm for “non-technical” users to effectively operate computer assistants

by wfurney

5/10/2026 at 8:17:15 PM

I figure english is the next coolest programming language for scripting and compilation. So far people have been writing fun little demos with it, but now people are starting to place real demands on it, and you're starting to see actual programs needing to be built. Unsurprisingly this requires a bit more craft.

by Kim_Bruning

5/10/2026 at 9:51:51 PM

> I figure english is the next coolest programming language for scripting and compilation. So far people have been writing fun little demos with it, but now people are starting to place real demands on it, and you're starting to see actual programs needing to be built. Unsurprisingly this requires a bit more craft.

Perhaps that craft of using the exact subset of English has something to do with the correct selection of words and concise, yet expressive enough, expressions, in a fashion resembling creating a code.

A code that's meant to be understood by machines, we could call it "computer code". And said computer code could be used to create recipes, algorithms, let's call them "programs". Hey, I think I have ideas for 2 possible names for this process!

by oblio

5/11/2026 at 2:56:01 AM

No wait, you think I'm being silly so that's why you're being a bit sarcastic back.

But seriously, you can put a shebang on an english text file now (if you're sufficiently brave), or feed it through something that spits out code on the other end (so you can proof read the consequences before executing them).

It's crazy, but this is 2026, and that actually ... just works. You can even do it locally, if you don't mind running a space heater.

Thing is, when you have the expressiveness and power of a full natural language (and you're already paying for it), why would you want to constrain yourself to a subset? That's not very practical. Why not use all of it? Computing was never about typing code into machines anyway. "Computer" used to be a human profession, until it got automated.

On the upside, there's thousands of years of documentation. On the downside, a lot said documentation is underspecified and/or straight wishful thinking. It's certainly an interesting avenue to explore.

by Kim_Bruning

5/11/2026 at 7:29:50 AM

> why would you want to constrain yourself to a subset? That's not very practical. Why not use all of it?

For the same reason math, physics, chemistry, etc figured out a long time ago that Koine Greek, Latin, French, German, English, etc aren't the best languages for science. Constraint gives focus, precision.

If you code novels, knock yourself out.

by oblio

5/11/2026 at 12:30:42 PM

Let's actually look at this as if I'm serious for a second? Tell me this framing really can't work.

None-exhaustively:

Python has if, for, while, def, class and first class lists, dicts , functions ;

Forth has this stack machine concept, RPN, compilation-in-the-REPL when defining new functions.

Lisp has this code is data is code concept, and CAR, CDR, first class lists (obviously ), first class functions (in some of them) ... etc.

Machine code can (theoretically) be directly expressed in logic gates.

How about a quick look at what English supports:

Conditionals, iteration, abstraction, composition, delegation, exception handling, scope, naming, modularity; intent, priority, graduated precision, analogy, context-dependence; And.. the concept of semantic triples is built in as a syntactic primitive (subject-predicate-object), so you can even do a bunch of GOFAI right off the bat.

It's weird thinking of english as a programming language. But it kind of works like one if you want to, and computers can process it now?

by Kim_Bruning

5/11/2026 at 1:41:25 PM

I'm not saying English (or any other natural language) is not usable. It is, since it's a more complex language than programming language. All natural languages are supersets of current programming languages.

I'm talking about the opposite problem: these supersets are ambiguous, contradictory, vague. At the end of the day the thing that is programmed needs to be clear, unambiguous and ideally concise, too (performance in its million incarnations).

So yeah, I guess you can fix the ambiguous aspect with verbosity. Just write more words until you define everything you need to define more directly when using a formal language.

I would be extremely shocked if programming wouldn't require knowing a very specific, albeit huge, domain jargon.

by oblio

5/11/2026 at 2:23:43 PM

Question isn't per-se "is English a great language to write the next sorting algorithm in?" . Probably not. Rust is quicker , and cheaper to execute besides. But there's entire classes of problem that English might be more useful for.

English assumes the target is an agent with memory/state in a given context. Ambiguity, verbosity, noise is strongly reduced by means of modelling the other agent's state, then only transmitting the required state diff. The receiver decodes by comparing the diff against the other side's predicted state and updating. [1] This kind of protocol would obviously be NUTS to build from scratch if you went about it as an engineer I'd think. But we have the hardware and software preinstalled in humans , and now my 3090 can run an (imperfect, but viable) decoder

Is it useful? Yeah, I think it actually is. English is able to encode things that are ambiguous, contradictory, vague... and get useful results. Not always; maybe not even often. As you say, skill required, but the option is there. Formal languages just crash.

It's interesting is what I'd call it.

[1] see also: Clark & Brennan's grounding theory in linguistics; Predictive coding in neuroscience; Delta encoding in compression; and Theory of mind in cognitive science. They all dance around the same shape, so this is roughly accurate I think.

by Kim_Bruning

5/11/2026 at 11:50:16 AM

In a semi-random sample of 10 recent articles on arxiv.org, 10 articles (100%) contained english language as the predominant part of the corpus. Where necessary mathematical notation was included.

So - you're not wrong that eg. mathematical notation is (often) used, as we both very well know. But English is really quite prominent!

And now computers can process both, where before they couldn't.

The engineering doesn't go away, not yet. Decomposition, abstraction, state management, blast radius containment O:-) . But now you can express much more of that in the language the arxiv papers are already written in.

by Kim_Bruning

5/11/2026 at 6:53:05 PM

> But seriously, you can put a shebang on an english text file now (if you're sufficiently brave)

That inspired me to figure out how to do exactly that:

https://til.simonwillison.net/llms/llm-shebang

  #!/usr/bin/env -S llm -f
  Generate an SVG of a pelican riding a bicycle

Thanks for the inspiration!

by simonw

5/12/2026 at 7:40:59 PM

A .llm file extension might be in order :)

by wfurney

5/11/2026 at 10:45:40 PM

Oh, that looks pretty clean!

by Kim_Bruning

5/11/2026 at 9:34:00 PM

> But seriously, you can put a shebang on an english text file now (if you're sufficiently brave), or feed it through something that spits out code on the other end (so you can proof read the consequences before executing them).

The funky thing is that it's not just English. I could vibe code in Romanian, it would probably be hilarious :-)) Probably not for whomever would have to take over the app, though.

by oblio

5/12/2026 at 6:51:50 PM

Eh? You should go for it! Do everything at least once, right? Pick some simple pet project, and get it off your bucket list!

If it wasn't on your bucket list to begin with, who cares: Now you can add it and complete it in one fell swoop ;-)

by Kim_Bruning

5/10/2026 at 7:45:01 PM

Scissors tend to be reasonably straightforward, so there your analogy seems to hold, but upgrading to chopsticks, needles, and chainsaws tends to attract increasing amounts of "you're doing it wrong" in increasingly alarmed tones of voice.

by Kim_Bruning

5/9/2026 at 9:04:35 PM

Exactly - I am a lawyer and we are told to use dedicated AI products as much and however we want. There will be errors made

by Sprotch

5/9/2026 at 11:07:54 PM

Much to the often-reported chagrin of judges across the country.

by rockskon

5/9/2026 at 7:46:51 PM

Only sort of related, but I would love to see a harness with ed as the primary file editing / reading tool. Half the bash Claude runs seems to be sed anyway, having some state persist in ed would seem to help.

What does one do when a full editor consumes too much bandwidth^H tokens? Use ed, the standard editor!

by kristjansson

5/10/2026 at 2:22:02 PM

I'm not sure you understand how those terminal programs are rendered - but the amount of control code data sent to Claude would be way, way more than using command line sed.

by nullsanity

5/11/2026 at 4:19:58 PM

You may want to take a look at `man ed`, `info ed`, or [0]. ed is many things, but verbose is not one of them.

In particular, it's designed for the teletype era, when (a) the user would have a trace of all the commands they'd sent and output they'd received, since it was literally printed on paper, and (b) output was literally printed on paper, and so had a direct, non-negligible cost.

This is more or less exactly the situation LLMs find themselves in. they can attend to ~all the prior output in their context window, but there's a direct cost to adding new symbols to context.

We've got a tool for exactly that setting, so it would be fun to try it!

[0]: https://en.wikipedia.org/wiki/Ed_(text_editor)

by kristjansson

5/10/2026 at 7:01:36 AM

It's worth noting that Claude Code itself doesn't use the `insert` tool. (It also uses custom edit tool not the suite's predefined str_replace)

Also as a person developing agentic code tools since before Claude Code, I'm skeptical if str_replace provides accuracy improvement over just full rewrite.

Back in the day when SOTA models would do lazy coding like `// ... rest of the code ...`, full rewrite wasn't easy. Search/replace was fast, efficient and without the lazy coding. However, it came with slight accuracy drop.

Today that accuracy drop might be minimal/absent, but I'm not sure if it could lead to improvements like preventing doc corruption.

by pcwelder

5/10/2026 at 8:15:41 AM

I've tested this extensively in a workflow (not agentic) context, and you're right, the underlying models are both good at full rewrite of code files, and at doing search/replace.

They've been decent at full rewrite for 2 years. I don't think they were good at search/replace until a year ago, but I'm not so sure.

It's true that the models 2 years ago would sometimes make errors in whole rewrite - e.g removing comments was fairly common. But I've never seen one randomly remove one character or anything like that. These days they're really good.

Main reason agentic harnesses use search/replace is speed and cost, surely! Whole file output is expensive for small changes.

by frabcus

5/10/2026 at 9:11:59 AM

I think your argument makes sense but my understanding is that adding the document to the context and spitting it back is prone to corruption in any scenario.

I think this is closely related to other sources saying that even if you have huge context the attention mechanism itself is not back referencing thus any tasks related to bigger contexts are prone to errors.

because I have some preconception of this maybe I am assuming this is what they were saying. Am I missing something ?

by motbus3

5/9/2026 at 2:44:54 PM

People love to interpret the results in the most negative way possible because it's a threat to their occupation and identity. I refer to HN specifically.

The fact of the matter is, if you want to edit a document by reading the document and then regurgitating the entire document with said edits... a human will DO worse then a 25% degradation. It's possible for a human to achieve 0% degradation but the human will have to ingest the document hundreds of times to achieve a state called "memorization". The equivalent in an LLM is called training. If you train a document into an LLM you can get parity with the memorized human edit in this case.

But the above is irrelevant. The point is LLMs have certain similarities with humans. You need to design a harness such that an LLM edits a document the same way a human would: Search and surgical edits. All coding agents edit this way, so this paper isn't relevant.

by threethirtytwo

5/9/2026 at 6:49:38 PM

> People love to interpret the results in the most negative way possible because it's a threat to their occupation and identity.

OR it could be because their concerns are genuine but are ignored in favour of a good sounding story.

by shahbaby

5/9/2026 at 8:52:23 PM

But no one in this thread addressed the inaccuracy of the experiment. The experiment did not test the actuality of HOW LLMs are used in reality.

So that is definitively a biased interpretation. This is independent of how accurate my POV or your POV is on whether LLMs degrade documents. I am simply saying the experiment conducted is COMPLETELY DIFFERENT from how LLMs AND humans edit papers.

by threethirtytwo

5/9/2026 at 7:15:30 PM

[dead]

by redsocksfan45

5/9/2026 at 11:44:15 PM

> a human will DO worse then a 25% degradation.

* than

by ActionHank

5/10/2026 at 12:49:34 AM

See that’s an example of degradation by a human. Not even an LLM wil make that kinda mistake.

by threethirtytwo

5/9/2026 at 3:53:13 PM

[flagged]

by ieieue

5/9/2026 at 8:07:42 PM

> a human will DO worse then a 25% degradation

As I was reading this article, a similar thought occurred to me: "I wonder if that's better or worse than a human?" Unfortunately, there was no human baseline in this study. That said, there are studies that compare LLM to human performance. Usually, humans perform much better (like 5-7x better) at long-running tasks.

In other words, a human would probably do better than an LLM on this task.

Humans lose to LLMs in narrow, well-specified text/symbolic reasoning tasks where the model can exploit breadth, speed, and search. Usually, the LLM performed ~15% better than humans, but I saw studies that were as high as 80%. To my surprise, these studies were usually about "soft skills" like creativity and persuasion.

by tieTYT

5/9/2026 at 8:54:30 PM

You can do a baseline study right now. Read this entire thread and make an edit of changing every E to an I.

Show your edit by regurgitating this entire thread by hand on a paper. Don't use any additional tools like Find and replace.

Boom there's your baseline. I can simulate the result in my head.

Guys I'm basically saying the experiment is innaccurate to the practical reality of how LLMs are actually used.

by threethirtytwo

5/11/2026 at 5:08:43 PM

> The modern coding agent harnesses put a LOT of work into the design of their tools for editing files. My favorite current example of that is the Claude edit suite [...]

Meanwhile, Claude Code extracting a part of a function, moving it to a different file, etc will corrupt your source code, just like the paper says. This is most noticeable as comments disappearing.

We need tooling with copy-paste/cut-paste style functionality, to avoid the LLM round-trip.

by yencabulator

5/10/2026 at 4:06:12 AM

Any rando can publish research nowadays. It means nothing. Just like "X country published N research papers last year". It is noise. In a world where it was required to attach age, experience level, and country of origin to every comment, research paper, or post on the internet, it would shatter the conviction we mistakenly have towards the information we receive.

This team is inexperienced and it shows.

The noise to signal ratio will get worse, even in "academia". Brace yourselves. The kids are growing up in this new world.

by Art9681

5/9/2026 at 11:48:55 PM

It could also be that much like most large orgs now you've made LLMs your entire personality, so you don't see the inherent bias.

Most LLM users who are not touching code are certainly not going to be using a harness. They're going to take all the documents, slam all those tokens into the context window, see they have only used 500k out of their 1M tokens and say "summarize".

by ActionHank

5/10/2026 at 1:04:28 AM

Wouldn't they be more likely to give ChatGPT access to a Google Drive folder or some such? The tools the agent has for editing documents will be whatever the app they used implemented.

by skybrian

5/9/2026 at 11:59:46 PM

Yeah, this is a bit of a strawman of an LLM task.

On editing tasks, one should only allow programmatic editing commands, the text shouldn't flow through the LLM at all. The LLM should analyze the text and emit commands to achieve a feedback directed goal.

by genxy

5/10/2026 at 11:20:54 AM

[flagged]

by ultrathink-er

5/9/2026 at 7:26:29 PM

[flagged]

by rs545837

5/9/2026 at 8:21:35 PM

[dead]

by javajive

5/9/2026 at 6:00:57 PM

The incomprehensible methodology due to resource constraints or straight up for simplicity's sake make these papers worthless unfortunately

by alansaber

5/9/2026 at 12:43:10 PM

Yeah I've been saying this for a while: AI-washing any text will degrade it, compounding with each pass.

"Semantic ablation" is my favorite term for it: https://www.theregister.com/software/2026/02/16/semantic-abl...

by causal

5/9/2026 at 1:25:42 PM

I've been calling it meanwit reversion

by mohamedkoubaa

5/10/2026 at 5:03:43 PM

code too.

Someone AI'd and "extended" some code I wrote, and it did what day-1 jr programmer did - moved all the whitespace lines around and destroyed readability.

Like "cleaning up" the mona lisa :)

by m463

5/9/2026 at 1:15:50 PM

By „with each pass” do you mean within the same session, or with new session (context window) each time?

by polskibus

5/9/2026 at 1:51:59 PM

In my experience, it happens with each edit of the document, whether or not you clear the context window.

You can somewhat mitigate this, at the same moment you ask for the new edit, by adding new info or specifying the lost meaning you want to add back. But other things will still get washed out.

Nuances will drift, sharp corners will be ablated. You're doing a Xerox copy of your latest Xerox copy, so even if you add your comments with a sharpie, anything that was there right before will be slightly blurrier in the next version.

by sebastiennight

5/9/2026 at 9:19:00 PM

Which is why I think AI assisted writing is better then just letting it write the full text (if you care about the quality of the result). The act of writing isn't just the production of text, it is about wrangling a topic, rotating it in your mind and finding the perfect expression for a thought you have and that you want to convey to others. Some of those things can't be known by the LLM since you don't know them yourself by the point you started out.

Often that thinking bit itself provides value to the person doing it, beyond the text itself. By letting a LLM do it for you, you rob yourself of the change of thought and the new findings you may encounter.

Working with LLMs just makes it quicker to get going, bit you need to be a ruthless editor.

by atoav

5/9/2026 at 1:57:50 PM

Each edit, even with unrelated edits. I had a README referring to something as "the cathedral of s*t" (some HN commentators don't care for the swearing, which is systemically bad news but w/e) and the robot would lift that phrase out in drive-bys, repeatedly.

Occasionally it would report the action, sometimes it would not bother to report it. It never reached into the README on an unrelated doc edit, but if it was touching the README, that line was getting excised.

by adampunk

5/10/2026 at 5:14:53 AM

That kind of passive-aggressive pseudo-moralizing is a common feature of all the current 'frontier' models. Try to do something like get one to summarize A Song of Ice and Fire text and it's likely to try and covertly sand off all the 'offensive' rough edges without even saying it's doing so.

by crooked-v

5/9/2026 at 9:14:38 PM

It is X, not Y!

by atoav

5/9/2026 at 2:13:14 PM

Least shocking thing I've read about LLMs recently.

They are essentially like that one JPEG meme, where each pass of saving as JPEG slightly degrades the quality until by the end its unrecognizable.

Except with LLMs, the starting point is intent. Each pass of the LLMs degrades the intent, like in the case of a precise scientific paper, just a little bit of nuance, a little bit of precision is lost with a re-wording here and there.

LLMs are mean reversion machines, the more 'outside of their training' the context/work load they are currently dealing with, the more they will tend to gradually pull that into some homogenous abstract equilibrium

by timacles

5/9/2026 at 5:18:29 PM

I've definitely experienced this while coding with LLMs. Often, after a flurry of feature work in which I thought I was being reasonably careful but moving very fast, I take a closer look at some small piece of code and go "holy shit". Then I have to spend a few hours going over everything and carefully reworking parts where things didn't quite go how I'd like, where I was unclear, or where the LLM's brainworms kicked in.

Quality is really important to me in its own right, but I also worry about this exact "repeated compression" problem: when my codebase is clean and I have an up-to-date mental model, an LLM can quickly help me churn out some feature work and still leave the codebase in a reasonable state. But as the LLM dirties up the codebase, its past mistakes or misunderstandings compound, and it's likely to flub more and more things. So I have to go back and "restore" things to a correct state before I feel comfortable using the LLM again.

by isityettime

5/9/2026 at 6:25:34 PM

This seems closely related to the problem of model collapse [1][2][3], where LLMs lose the tails of the distribution, and so when you recursively train on the output of an LLM, or otherwise feed the output back into the input in subsequent stages, you lose the precision and diversity that human authors bring to the work. Eventually everything regresses to the mean and anything that would've made the content unique, useful, and differentiated gets lost.

My takeaway from this is that AI is a temporary phenomena, the end stage of the Internet age. It's going to destroy the Internet as we know it as well as much of the technological knowledge of the developed world, and then we're going to have to start fresh and rebuild everything we know. My takeaway is that I'm trying to use AI to identify and download the remaining sources of facts on the Internet, the human-authored stuff that isn't generated for engagement but comes from the era when people were just putting useful stuff online to share information.

[1] https://en.wikipedia.org/wiki/Model_collapse

[2] https://www.nature.com/articles/s41586-024-07566-y

[3] https://cacm.acm.org/blogcacm/model-collapse-is-already-happ...

by nostrademons

5/9/2026 at 8:26:07 PM

Yep humans and civilization are subject to the same model-collapse phenomenon as they interact more with LLMs, but engineering knowledge has always been held by a small minority with certain personality characteristics. Maybe the minority will get smaller but I'm not sure it will completely disappear. There's always people like yourself building archives.

by davebren

5/10/2026 at 7:50:19 AM

See A Canticle for Leibowitz

by lukebuehler

5/10/2026 at 3:59:03 PM

Ah that looks very interesting I'll check it out, thanks.

by davebren

5/9/2026 at 6:43:16 PM

There are plenty of AIs that are immune to this because they're trained on something that won't be flooded with slop. E.g. robotics, self-driving cars (both trained on real camera/sensor inputs) or programming/proof-assistant stuff (trained on things that are verifiable).

by the8472

5/9/2026 at 5:35:42 PM

My experience mostly matches this: I think of a piece of development work having three phases:

1. Prototype 2. Initial production implementation 3. Hardening

My experience with LLMs is that they solve “writer’s block” problems in the prototyping phase at the expense of making phases 2+3 slower because the system is less in your head. They also have a mixed effect on ongoing maintenance: small tasks are easier but you lose some of the feel of the system.

by fiddlerwoaroof

5/9/2026 at 6:14:31 PM

I completely agree with all of these observations.

And indeed for me, the biggest productivity boost has nothing to do with my "typing speed" or any such nonsense, it's that it can help with writer's block and other kinds of unhelpful inertia.

It kind of reminds me of ADHD medication: it alleviates the "inability to direct attention at one thing" problem, but actually exacerbates the "time blindness" and "hyperfocus" problems.

I think probably a lot of complex tools have these characteristics: useful in some ways, liable to backfire in others, and ultimately context-sensitive (and maybe somewhat unpredictable) in their helpfulness.

Hopefully as LLMs are more widely experimented with by developers, the conversation can continue to move away from thinking about the effects of LLM use in terms of some uniform/fungible "productivity" and towards understanding where it hurts and where it helps, how to tell when it's time to put it away, what kinds of codebases are really hurt by that kind of detached engagement, and what kinds of projects leverage that sort of rapid prototyping the most effectively.

Plausible text generation is an almost magical trick, whether it's generating human language or computer code. But it turns out it's not a silver bullet, no matter how impressive the trick is. It's more interesting than a silver bullet, in fact: it's a system of surprising tradeoffs, even for different phases of the same overall task.

by isityettime

5/9/2026 at 6:34:58 PM

Usually you'll iterate several times on #1, which is where LLMs are really helpful. They let you get working code from stage #1 quite quickly, so you can check the output and behavior, and then oftentimes you'll find that you framed the problem incorrectly in the first place. Then you can fix your problem definition, have the LLM rewrite the code, try it again, and so on, until you get the results you want.

#1 -> #2 is a gap, but it also helps if you ask the LLM to explain its thinking and generate a human-readable design-doc of the approach it took and code organization it used. Then you read the design doc to gain the context, and pick up with #2.

by nostrademons

5/9/2026 at 5:23:44 PM

Yeah, a lot of "it doesn't matter how the code looks" convos seem to be ignoring that we know what happens over time when you just make tactical the-tests-still-pass changes over and over and over again. Slowly some of those tests get corrupted without noticing. And you never had the ENTIRE spec (and all the edge-case but user-relied-on-things) covered anyway. And then new dev gets way harder.

by majormajor

5/9/2026 at 5:49:47 PM

This is definitely most annoying when dealing with software or standards with slightly illogical or hard to grasp cases. Recently, I worked on one of the software community's favourite spaces, timezones, and kept getting myself and my LLM context polluted with the confusion that arises when using POSIX standard timezone notation and common human-readable formats.

This blog probably covers my exact headache [0]. In summary, "Etc/GMT+6" actually means UTC-6. I was developing a one-off helper script to massively create calendars to a web app via API, and when trying to validate my CSV+Python script's results, I kept getting confused as to when do the CSV rows have correct data and when does the web app UI have correct data. LLM probably developed the Python script in a manner that translated this on-the-fly, but my human-readable "Calendar name" column which had "Etc/GMT+6" would generate a -6 in the web app. This probably would not have been a problem with explicit locations specified, but my use case would not allow for that.

When trying to debug if something is wrong, the thinking trace was going into loops trying to figure out if the "problem" is coming from my directions, the code's bugs, or the CSV having incorrect data.

Learning: when facing problems like this, try using the well-known "notepad file" methods to track problems like this, so that if the over-eager LLM starts applying quick code fixes – although YOU were the "problem's" source – it will be easier to undo or clean up code that was added to the repository during a confusing debug session. For me, it has been difficult to separate "code generated due to more resilient improvements" vs. "code generated during debugging that sort of changed some specific step of the script".

(Do note that I am not an advanced software engineer, my practices are probably obvious to others. My repos are mainly comprised of sysadmin style shell/python helper code! :-) )

[0]https://blacksheepcode.com/posts/til_etc_timezone_is_backwar...

by originalvichy

5/9/2026 at 6:21:42 PM

> when facing problems like this, try using the well-known "notepad file" methods to track problems like this, so that if the over-eager LLM starts applying quick code fixes – although YOU were the "problem's" source – it will be easier to undo or clean up code that was added to the repository during a confusing debug session. For me, it has been difficult to separate "code generated due to more resilient improvements" vs. "code generated during debugging that sort of changed some specific step of the script".

Yeah, I have definitely hit this as well. Sometimes I've named a function or variable in a way that misuses a term or concept, or I've changed what something does without fully thinking it through. The LLM sees that code, notices an inconsistency, and makes a guess about what I meant. But because I screwed up, only I know what I really meant (or what I "should have meant"). So the LLM ends up writing a fix that breaks assumptions made in other parts of the code— assumptions that fit with my overall original mental picture, but not the misnomer the LLM got snagged on. Or it writes a small-scoped fix but the mistake of mine it stumbled upon actually merits rethinking and redesigning how some parts interact, so even if its fix is better than what I had before, I want to unwind that change so I can redefine my interfaces or whatever.

That's definitely worth calling out: it's not only the LLM's mistakes that make it more likely to commit future mistakes. Any mistakes in the codebase can compound like that. If you want an LLM to do useful work for you, it's more relevant than ever to "tidy first".

by isityettime

5/9/2026 at 2:56:34 PM

Where this result is actually interesting and relevant is when a coding agent splits a large source file into multiple smaller files. Opus + Claude Code will try to recite long sections of source code from memory into each of the new files, instead of using some sort of copy/paste operation like a human would.

Moving a file is a bit easier. LLMs may sometimes try to recite the file from memory. But if you tell them to use "git mv" and fix the compiler errors, they mostly will.

Ordinary editing on the other hand, generally works fine with any reasonable model and tool setup. Even Qwen3.6 27B is fine at this. And for in-place edits, you can review "git diff" for surprises.

by ekidd

5/9/2026 at 3:33:33 PM

If you’re using LLMs for agentic work it is absolutely essential that you have a robust set of tools for them to use and the correct instructions to prompt their use.

The LLM will come up with stupid ways to do things, common sense doesn’t exist for AI.

by devmor

5/9/2026 at 4:04:36 PM

Isn't this the whole reason they became viable in the last 6 months? The system prompt and harness is improving. It's less and less essential every day to roll your own.

by jvuygbbkuurx

5/9/2026 at 4:13:40 PM

I don't think there is a single reason. Models are improving, so are the harnesses, prompts and we who use them a lot also get more proficient and learn where they can be used effectively vs not, so lots of improvements all over the ecosystem, brought together.

Latest big change is probably how feasible local models are becoming, like Qwen 3.6 and Gemma 4, they're no longer easily getting stuck in loops and repetition, although on lower quantizations they still pretty much suck for agentic usage.

by embedding-shape

5/9/2026 at 4:55:23 PM

> we who use them a lot also get more proficient and learn where they can be used effectively vs not

I think it’s always been obvious where an LLM could be used effectively and where it cannot, if you understand how they work and don’t see them as magical.

The “increase in proficiency” is mostly people coming back to reality and being more intentional about LLM usage. There are no surprise discoveries here. One does not need to use an LLM a lot to get effective with them. A total noob could become effective on day 1 with proper guidance.

by deadbabe

5/9/2026 at 7:32:06 PM

I think you hit the nail on the head. I had been in this space for a little bit before it really became popular. I haven’t seen incredible gains in model competency. What I have seen though is people figuring out what works and what doesn’t.

by ofjcihen

5/10/2026 at 11:40:38 AM

It’s pretty telling that ignoring LLMs entirely for a few years and then jumping in last minute after everyone has struggled through figuring out how to use them still puts you on the same level very quickly.

by deadbabe

5/10/2026 at 12:23:51 PM

> then jumping in last minute after everyone has struggled through figuring out how to use them still puts you on the same level very quickly

Does it actually though?

I've used agents for quite some time now, if someone who never used agents before want to put this to the test somehow, I'm open to try to measure this, reach out via email :)

by embedding-shape

5/11/2026 at 1:10:26 AM

What would you even propose as a test

by deadbabe

5/12/2026 at 6:52:45 PM

Can be anything, refactor of old thing, build new thing, change existing thing, something like "Do X in Y hours/days, then we compare the results", and we'll see if there is any differences. Do it a couple of times, you might be able to figure out how to measure it as you discover the better methodologies.

by embedding-shape

5/9/2026 at 4:30:52 PM

The models also have far more intelligence built in. For example, the pi.dev agent harness has a system prompt which fits on a single page, and includes only 4 or 5 tools. Running with a small coding model like Qwen3.6 27B, this setup is completely capable of agentic coding.

by ekidd

5/9/2026 at 4:47:10 PM

They still aren't viable. Nothing changed within the last 6 months.

by bigstrat2003

5/9/2026 at 6:53:29 PM

My favorite is when Claude will build a completely new application to load and inspect a .dll file using reflection instead of just googling the library's interfaces.

by Salgat

5/10/2026 at 2:16:16 AM

It did this for during one of the recent outrage periods. It was unjarring deps left and right instead of googling for it. What an easy way for me to own the tokenmaxxing leaderboard I remember thinking

by smrtinsert

5/9/2026 at 11:00:25 PM

“Use all of the tools at your disposal, including searching the internet” is my claude-specific common instruction.

by devmor

5/9/2026 at 4:41:32 PM

> And for in-place edits, you can review "git diff" for surprises.

I don't let AI touch git anyway, and I always review the diff after it generated stuff. If it modifies my documentation, I always want to check if it messed with the text instead of just added formatting.

by ClikeX

5/9/2026 at 5:23:13 PM

This. I know the LLM agents often have their own little diff viewers and edit approval workflows, but for a high volume of code, I cannot imagine actually reviewing everything without leaning on much more capable Git tooling.

I use Magit, and up until I started using LLM agents it was mostly a nice-to-have that I relied on casually. (I was definitely under-utilizing its power.) But for reviewing, selectively staging, and selectively rejecting the changes of an LLM agent? I feel like I'd die without it. Idk how others manage.

by isityettime

5/9/2026 at 3:26:02 PM

There's a kid's game that illustrates this too: https://en.wikipedia.org/wiki/Telephone_game

by Kim_Bruning

5/9/2026 at 4:14:15 PM

Maybe more relatable to the typical HN reader: You know when the top boss tells the lower bosses stuff, who then tells the lower bosses something and once it reaches you as an IC it's all different and corrupted compared to what it initially was? LLMs have the same effect, unsurprisingly.

by embedding-shape

5/9/2026 at 2:46:44 PM

A coworker talks about LLMs as "bullshit" layers. Not exactly dismissing them or being derogatory about them, but emphasising that each time you feed something through an LLM, what comes out the other side may not be what you expect/want. Like that guy at the pub sharing what he'd seen online somewhere, after a few pints. Might be accurate, but carries notable risk it's not.

So e.g., don't use an LLM to call an API to gather data and produce a report on it, as that's feeding deterministic data through a "bullshit" layer, meaning you can't trust what comes out the other side. Instead use the LLM to help you write the code that will produce a deterministic output from deterministic data.

I've seen co-workers use LLMs to summarise deterministic data coming from APIs and have reports be wildly off the mark as often as they are accurate. Depending on what they're looking at that can have catastrophic risk.

by Twirrim

5/9/2026 at 3:06:11 PM

Similar experience. I wouldn't say it even needs to be like some random person in the local pub: this behaviour is what you'd get from any game of telephone, book authors will say how you need to be blunt and direct about points in the book because readers will miss subtlety, anyone who has been quoted in a newspaper will have a story about the paper getting it wrong, etc.

However, there's a reason pre-computing bureaucracy came with paper trails and meeting minutes getting written up, why court cases are increasingly cautious about the reliability of eye witnesses.

It is ironic, the more AI becomes like us and less it acts like a traditional computer program, the worse it is at many things we want to use it for, but because collectively we're oblivious to our cognitive limitations we race into completely avoidable failures like this.

by ben_w

5/9/2026 at 3:37:39 PM

> However, there's a reason pre-computing bureaucracy came with paper trails and meeting minutes getting written up, why court cases are increasingly cautious about the reliability of eye witnesses.

This was the comment I was coming in to make: I worked in a pre-computing bureaucracy (the U.S. Navy's) and "staff you delegated work to have consistent trouble following the directions you provide for the delegated work" is just a fact of life.

A lot of it is telephone game, a lot of it is is lack of real familiarity with office software, a lot of it is the inherent integration challenge from sending the same document out for coordination to dozens of stakeholders.

All those mistakes you made fixes for based on comments in the draft that went out for O-6 review? At least 2 will pop up again at 1-star review because staffers will copy the same text back out from their local copy they had stashed during O-6 review rather than re-reviewing from scratch.

Style guidance to meet the Admiral's preferred format? You can provide it but there's not a chance they'll follow it, formatting is for humanities majors so you'll need to catch and fix all that yourself.

That's not to say the LLMs are foolproof or magically always correct, but a lot of these style of criticisms apply just as much, if not more, to the current status quo. I don't need LLMs to be perfect, I just need them to be better than the current alternatives.

by mpyne

5/9/2026 at 3:50:46 PM

Before Claude Code my strategy in JetBrains AI was to start a new chat convo per task it yielded better output.

by giancarlostoro

5/9/2026 at 4:54:50 PM

I like this framing. At least as "nondeterministic" vs "deterministic" layers for the folks who flinch at "bullshit." Also "broadly capable but lossy" versus "limited capability but reliable."

Building structures of dependencies, the interface between each pair seems to collapse to the lesser of the two. So there's a ton of work right now going into TLA+, structured io, etc to force even a bit of reliability back into the LLM/program boundaries. To have any hope of chaining multiple LLM dependencies in a stack without the whole thing toppling chaotically.

by glaslong

5/9/2026 at 6:51:01 PM

> the more they will tend to gradually pull that into some homogenous abstract equilibrium

I experienced this with resume editing. The LLM removes everything that differentiates my resume from a pile of junior engineers with “average” experience. Anything that was special or unique or different was eventually replaced with generic stuff

Of course I didn’t use what it produced, but it was maddening because the LLM kept insisting this was better than what I had.

I found LLMs to be much more useful in suggesting edits to very small chunks of my resume (a sentence or three) rather than the overall vision of the document.

by TedDoesntTalk

5/9/2026 at 4:37:42 PM

I was talking about this in a thread yesterday. It’s why I don’t like blogs that are just LLM generated. I don’t care how good you think it is, I don’t care that you consider a facsimile of you good enough. If I want a rote, boring LLM response, I will prompt it myself. I do not appreciate reading blogs and other assumed to be human-generated content and having somebody attempt to trick me into reading their prompt results like some annoying middleman.

I came to your blog to read what you had to say. Why are you writing a blog if you aren’t even going to write it?

by Forgeties79

5/9/2026 at 2:51:07 PM

A human doing the same tasks as what the LLM did in the paper that the human will degrade the document further then the LLM. If the LLM is 25%, a human would degrade it probably 80% if they used the same technique as the LLM did in this paper. I'm talking about a single pass.

The fact of the matter is, humans don't edit things the way it was done in the paper and neither do coding agents like claude. Think about it: You do not ingest an entire paper and then regurgitate that paper with a single targeted edit... and neither do coding agents.

Also think carefully. A 25% degradation rate is unacceptable in the industry. The AI change that's taking over all of SWE development would not actually exist if there was 25% degradation... that's way too much.

by threethirtytwo

5/9/2026 at 3:02:38 PM

Are we comparing humans to LLMs or human written software to LLMs?

The whole point of creating software to do things used to be getting things done more accurately and consistently.

by lelanthran

5/9/2026 at 3:33:34 PM

No. The whole point of creating software is getting things done.

"More accurately and consistently" was merely downstream from what capabilities were natural for machine logic and hard algorithms.

Now, we're just spoiled for choice. We have hard algorithm software where we want to do things that benefit for accurate, consistent, highly deterministic behavior - and we have soft algorithm AI for when we want to do things that simply aren't amenable to hard logic.

Machine translation used to be a horrid mess when we were trying to do it with symbolic systems. Because symbolic systems are "consistent, highly deterministic" but not at all "accurate" on translation tasks. Being able to leverage LLMs for that is a generational leap.

by ACCount37

5/9/2026 at 5:56:54 PM

All of software is hard-coded algorithm.

If you differ between AI source code and engineer source code say so. "Getting things done" is a business need. Which things get translated to a deterministic language executable by a computer is code.

There are entire languages dedicated for lesser engineers/domain experts to formulate business requirements.

Anyhow; What's your point? That we received a framework for "soft algorithms" where the output does not need to be correct and deducible? What's even the point of putting it into software. Just forward your input to the reader and let him judge on its own.

by tommyage

5/9/2026 at 8:02:15 PM

AI is more "grown" than it is "hard-coded". It's sideways to normal software - the way DSP is sideways to normal software but somehow even worse.

It all comes down to hard logic eventually, but that "eventually" has teeth. None of the interesting behaviors of AI systems live in "engine.py".

My point is: there are tasks where the choices are to use AI, use a meatbag, or suck forever. The "use AI" option going to be flawed, and often in the same ways "use meatbag" is. But it's going to be cheaper, much more scalable, and a lot better than "suck forever". Humanlike flaws are the price you pay for accessing humanlike capabilities.

by ACCount37

5/10/2026 at 2:35:41 PM

The contempt you people have for your fellow human beings is palpable. Your point is obfuscated because your objective seems to be to degrade others.

by emp17344

5/10/2026 at 4:09:15 PM

You seem to hold human beings in way too high of a regard. That alone, in my eyes, would be well worthy of contempt.

by ACCount37

5/10/2026 at 4:36:36 PM

What? You sound like a cartoon villain. What’s wrong with you?

by emp17344

5/10/2026 at 5:50:23 PM

You sound like the type of meatbag that goes "a human would never make this type of mistake", then makes a mistake at least twice as embarrassing, and either never notices, or immediately memory holes it to preserve his overinflated self-image.

by ACCount37

5/9/2026 at 3:20:05 PM

Except that coding agents will do this at times. That's half the problem. A human will forget details and exaggerate others, but LLMs fail in spectacular ways that humans rarely would, like trying to copy a document from memory rather than one word at a time, side by side, or rewriting the whole thing just to make some simple changes. Coding agents will delete tests or return True to get them to pass - something you would never expect of even a junior professional.

And I know this because I see it all the time. I use composer-2 and sonnet 4.6 on a regular basis. It's not much better for my colleagues who use Opus or GPT or any of the other frontier models. Most of the time it's fine, but other times it does things simply unforgivable for a human. I have to watch the agent closely so that it doesn't decide to nuke my database; I don't have to do that with any of my juniors, even those with little experience and poor discipline.

by RevEng

5/9/2026 at 3:47:24 PM

> nuke

> I don’t have to do that with any of my juniors…

For some values of “nuke,” I absolutely have had to do that with juniors in the past. Perhaps you’re referring to a single rm -r or hilarious force push or something, but undertrained and unsupervised juniors regularly introduce things like SQL injection, XSS, etc. simply because they don’t know any better yet. This isn’t saying “AI is better across the board” - I just don’t think they’re comparable, also think AI shouldn’t be used to chop the bottom 5 rungs off our career ladder. But let’s not pretend juniors can be left alone with a codebase without any worries.

by xp84

5/9/2026 at 6:33:12 PM

My half-baked solution is requiring colocation of the "why" for every decision and doc the llm writes, ideally my exact words. And similarly, every so often the llm why it's doing something reveals a mismatch between your intent and its PoV.

by chermi

5/9/2026 at 5:39:59 PM

Further, could we think of intent as some ordered state, and over time the LLM introduces entropy, eventually resulting in something akin to free-association?

by mrcartmeneses

5/9/2026 at 3:50:06 PM

LLM’s are the most elaborate guessing machine man-kind has made. That’s makes it both useless and useful depending on what it is used for.

That’s it. Once you look at everything through this lense everything makes sense - especially the fact there is no underlying understanding of reasoning and creativity. I don’t care what boosters say.

by ieieue

5/9/2026 at 4:04:14 PM

I don't know what a "booster" is, but if a model can solve original math problems, then it's reasoning.

If you can come up with a way to do math without reasoning, that would be, in a sense, even more interesting than AI.

by CamperBob2

5/9/2026 at 4:29:23 PM

A model solving original math problems may look like human reasoning, but internally the model is choosing the next token based on what it has learned about probability around various patterns and structures. The model knows about correlations between problems, proof techniques and answer structures, and when it "reasons" it's selecting a high probability trajectory through that learned knowledge.

A calculator is different because it is not probabilistic; it executes a fixed procedure. One of these models, when doing math, is more like a learned probabilistic system that understands enough structure around mathematics that some of its high probability trajectories seem like genuine reasoning.

The difference is that when a human reasoner goes to solve a problem, they'll think "this kind of proof usually goes this way" - following an explicit rule enforcement. The model may produce the same output, and may even appear to approach it the same way, but the mechanism is a probabilistic pattern selection rather than explicit rule enforcement.

by figarus314

5/9/2026 at 5:56:31 PM

You talk as if problem solving is a supervised (imitation) learning problem. No, it is a reinforcement learning problem, models learn by solving problems and getting rated. They generate their own training data. Optimal budget allocation is 1/3 cost pre-training, 1/3 for RL, and 1/3 on inference.

by visarga

5/9/2026 at 5:22:30 PM

> The difference is that when a human reasoner goes to solve a problem, they'll think "this kind of proof usually goes this way" - following an explicit rule enforcement.

How is this different from "probabilistic pattern selection"?

by XMPPwocky

5/9/2026 at 6:14:04 PM

Because... it's just different, that's all! OK?

by CamperBob2

5/9/2026 at 5:44:51 PM

I don’t think there’s any evidence that “human reasoning” isn’t also based on probabilistic pattern selection.

by senordevnyc

5/9/2026 at 5:17:38 PM

[flagged]

by ieieue

5/10/2026 at 8:37:35 AM

You should leave this site. Comments like this are not good for this site. You should go somewhere else.

by threethirtytwo

5/9/2026 at 6:01:39 PM

> If you can come up with a way to do math without reasoning, that would be, in a sense, even more interesting than AI.

Logic is just syntactic manipulation of formulas. By the early 90s logical reasoning was pretty much solved with classical AI (the last building block being constraint logic programming).

by oldsecondhand

5/9/2026 at 6:13:26 PM

So you'll be able to show me the early-90s era program that can solve original IMO-level problems when supplied with the plaintext questions. Right?

by CamperBob2

5/9/2026 at 6:52:06 PM

if i presented math problems to the best english mathematicians in chinese, does that mean they arent able to reason? the plain text is an arbitrary constraint

by 8note

5/9/2026 at 7:12:54 PM

The actual question is, if you presented an undergraduate-level calculus problem to a human who is considered intelligent but who was never given an "understanding" of math in school, would the human be able to solve it? Why or why not?

If so, what exactly would you call the process by which the intelligent human solves the math problem that he or she does not initially understand?

Whatever you call that process is what a reasoning model does. You don't have to call it "reasoning," of course... unless you want other people to understand what you're talking about.

by CamperBob2

5/9/2026 at 4:17:02 PM

My dear sir, the entire universe is made of things that "do math without reasoning!"

It's the default, and if we're lucky we harness pieces of it to discern something we're interested in.

by Terr_

5/9/2026 at 5:16:26 PM

[flagged]

by ieieue

5/9/2026 at 3:43:20 PM

I think the problem is that we're using LLMs to do too much of the work. We should aim to design agents that use the LLM as the thinnest possible layer to translate the natural language intent into a deterministic process, minimizing round trips to the LLM as much as possible.

by wtetzner

5/9/2026 at 11:04:38 PM

This becomes clear to anyone that wants to do marginally complex work. Developing pipelines that combine pre-processing flows, semantic targeting, and minimal contextual calls to an LLM API gets you powerful automated steps. Combined with separate validation steps, LLMs go from toys to useful.

by whatisthiseven

5/9/2026 at 9:52:25 PM

A process isn't automated until neither human nor genie is in the loop.

by mohamedkoubaa

5/9/2026 at 5:27:31 PM

I typically tell my agents to only treat document writing as a last "rendering" pass. LLMs are so good at taking sparse knowledge and compiling it, that I prefer to store knowledge as composable ideas/facts.

What has worked well in practice is giving the agent a directory, and tell it to make independent markdown files for facts/findings it locates - with each file having front-matter for easy search-ability.

This de-complects most tasks from "research AND store iteratively in a final document format" to more cohesive tasks "research a set of facts and findings which may be helpful for a document", and "assemble the document".

Only a partial mitigation, but find it leads to more versatile re-use of findings, same as if a human was working.

by buffaloPizzaBoy

5/10/2026 at 4:19:12 AM

Sounds like a good system. To use the analogy from ths other comment, this would be like running an image through JPEG compression twice.

The issue happens then if you're updating the individual research files on a regular basis. (Or making a long series of commits on a starting code base.) Every edit has a chance of doing a drive-by cleanup on nearby lines. Over a long enough timeline, it'll ablate your logic into something featureless, like if you compress an image too many times.

by xstas1

5/9/2026 at 12:42:39 PM

I really liked the evaluation method here - testing fidelity by round-tripping through chains of invertible steps. It was striking how even frontier models accumulated errors on seemingly computer-friendly tasks.

It would be interesting to know if the stronger results on Python are not just an artefact of the Python-specific evaluation, if they carry over to other common general-purpose languages, and if they are driven by something specific in the training processes.

by jonmoore

5/10/2026 at 11:46:46 AM

> Our main experiment is a round-trip relay with N = 10 consecutive round-trips per environment, simulating 20 delegated interactions. In each interaction, the model receives all work environment documents as text in its context window in a single turn

The LLM isn't being given an actual file system they can work with - they're expected to receive the document as text in the prompt, perform a task, and then re-output text into the conversation?

Maybe I'm misunderstanding the methodology, but this feels a lot like the human game of Telephone - or perhaps, asking one to do a similar editing task using only Microsoft Outlook with copy/paste disabled.

I'd imagine that one gets radically different results if one uses the appropriate desktop tools, just like humans do much better outside games of Telephone.

by handoflixue

5/9/2026 at 1:42:52 PM

LLMs will make mistakes on every turn. The mistakes will have little to no apparent connection to "difficulty" or what may or may not be prevalent in the training data. They will be mistakes at all levels of operation, from planning to code writing to reporting. Whether those mistakes matter and whether you catch them is mostly up to you.

I have yet to find a model that does not make mistakes each turn. I suspect that this kind of error is fundamentally incorrigible.

The most interesting thing about LLMs is that despite the above (and its non-determinism) they're still useful.

by adampunk

5/9/2026 at 2:37:53 PM

> I have yet to find a model that does not make mistakes each turn

What kind of mistakes are you talking about here?

by simonw

5/9/2026 at 1:49:20 PM

As a human I make typos all the time

by pyrolistical

5/9/2026 at 2:18:15 PM

A human can sit down and say “I’m going to make sure this is correct on the first pass and make sure I make an exact copy.”

They have cognitive awareness of which tasks are highly critical and need more checking and re-checking without being prompted to think that way.

For a human, time doesn’t stop when the first pass of the prompt and response is over. An LLM effectively wipes its memory of what it just did unless something is keeping track of a highly resource constrained context.

An LLM is like an author of a book that immediately closes its eyes and wipes its memory after writing a chapter. Sure, it can pull some of that back in the next query via context, and it can regain context very quickly, but it effectively has no memory of the exact thing it just did.

When a human is doing these tasks there is a lot of room for mistakes but there’s also a wildly higher capacity for flowing through time.

by dangus

5/9/2026 at 2:24:54 PM

[flagged]

by adampunk

5/9/2026 at 2:33:40 PM

Humans understand what mistakes are and can reason about what constitutes a mistake and what doesn’t. LLMs can’t do that.

It’s for the same reason that they will invent bullshit instead of saying “I don’t know”, when they don’t know. They don’t have a concept of accuracy of facts.

by simonh

5/9/2026 at 2:28:49 PM

And that’s why I’m paid six figures and my LLM is paid $20/month.

by dangus

5/9/2026 at 6:06:09 PM

The LLM makes typos for me all the time using AI autocomplete. It's caused a lot of frustration while coding, because it makes mistakes that I would not. When it does help, it's great, but the errors waste as much time as the LLM saves me. Even using agentic coding, AI is mostly break-even for me.

by leptons

5/9/2026 at 1:58:54 PM

I do too! I also make higher level design errors and get too enthusiastic about projects before code is written.

We are, in a sense, fallible machines who have designed a planet-wide computational fabric around that fact.

by adampunk

5/9/2026 at 2:24:08 PM

[flagged]

by peyton

5/9/2026 at 2:36:39 PM

> We find that models are not failing due to “death by a thousand cuts” (i.e., many small errors). Instead, they main- tain near-perfect reconstruction in some rounds, and experience critical failures in a few rounds, typically losing 10-30+ points in a single round trip

> We find that weaker models’ degradation originates primarily from content deletion, while frontier models’ degradation is attributable to corruption of content.

I think we largely already knew this. This is why we fudge around with harnesses and temperature etc.

by meander_water

5/9/2026 at 3:44:32 PM

I've spent the last few months reading a lot of AI-generated code. It's extremely difficult.

It's like how psychopaths are eerie because there's nothing behind their eyes. AI-generated code is eerie because there's nothing between the lines. Code is in some sense theory building, and when you read a humans code you can (mostly) feel their theory working in the background. LLMs have no such theory, the code is just facts strewn about. Very weird experience to try and understand it.

by danielvaughn

5/9/2026 at 5:59:18 PM

My company is moving to a workflow where we only write Jira tickets, the LLM writes all the code and submits a PR. Then we are supposed to review the code the LLM wrote.

I'm looking for a new job.

by leptons

5/9/2026 at 7:02:00 PM

that doesnt seem particularly horrible, as long as you as the engineer can still go change things in the code package and surrounding infrastructure to improve the output, and make sure that the agent is actually making the right stuff the first time you see the outputs

eg. setting up better feedback loops, improving CI/CD, breaking changes up at the right scale, etc.

you i assume also can then put in more work up front, doing simulations of solutions, lean proofs, and so on?

more engineering, less plumbing

by 8note

5/9/2026 at 11:16:37 PM

The change is turning me from someone who writes pretty good reliable code, to someone who has to read and review pretty bad code. If you think this is an improvement, you're nuts.

It is inserting a pretty unreliable middle-man know for errors and hallucinations, that often just goes down and stops working for reasons we can't control into a workflow that has worked well for a decade, and we're paying extra to really break-even on the time spent creating new code.

Just because "everyone else is doing it". Not because it's proving to be a boon in productivity.

by leptons

5/10/2026 at 4:20:43 AM

Turning you into a "reverse centaur" to borrow a term from Cory Doctorow

by xstas1

5/10/2026 at 7:04:47 AM

my (gender neutral) dude.

WAKE UP.

Literally anyone can write a Jira ticket. US engineers are expensive. What do you think will happen when the powers that enacted this policy decide that the ticket to merged into prod rate is acceptable to them?

by nunez

5/9/2026 at 5:05:10 PM

Thank you I've had trouble articulating this sense, but it's strong. An uncanny valley.

by glaslong

5/10/2026 at 1:07:59 PM

My problem with this kind of work is—-obviously they do. Did anyone seriously think otherwise? I’m shocked why these are even questions deserving scientific scrutiny. Have people truly lost their critical thinking that badly already?

by fatso784

5/10/2026 at 11:34:10 PM

In history, why did scientists research gravity for so long? Were they too stupid to realize that they were obviously being pulled towards the ground? No. They hoped to learn about the details. Eventually they learned details that were not apparent from everyday experience, such as the formula for how gravity scales with mass.

It’s the same here. For example, this study concluded that most changes are safe and some are very bad, as opposed to most changes being slightly bad. That is not obvious, especially to infrequent LLM users.

Also, even “obvious” conclusions are within the scope of science. I’ve spent too long writing this already to look up an example, but I bet there have been countries in the past whose leaders chose “obviously-good” monetary policies that economic research could have shown was counterproductive. The world is complicated, and without systems of communication such as academia, it’s hard to be sure if what you see is what everyone else sees.

by roryokane

5/9/2026 at 12:45:32 PM

I played around with a local LLM to try and build a wiki like DAG. It made a lot of stupid errors from vague generic things like interpreting based on file names to not following redirects and placing the redirect response in them.

I've also had them convert to markdown something like an excel formatted document. It worked pretty well as long as I was examining the output. But the longer it ran in context, the more likely it was to try in slip things in that seemed related but wasn't part of the break down.

The only way I've found to mitigate some of it is to make every file a small-purpose built doc. This way you can definitely use git to revert changes but also limit the damage every time they touch them to the small context.

Anyone who thinks they're a genius creating docs or updating them isnt actually reading the output.

by cyanydeez

5/9/2026 at 1:56:46 PM

> I've also had them convert to markdown something like an excel formatted document.

This look like a task where the LLM would be best used in writing a deterministic script or program that then does the conversion.

Trusting a LLM to make the change without tools is like telling the smartest person you know to just recite the converted document out loud from memory. At some point they'll get distracted, wrong, or unwittingly inject their own biases and ideas into it whenever the source data is counter-intuitive to them.

by sebastiennight

5/9/2026 at 2:23:55 PM

I see people cut and paste from Excep into a chat, as an image, and ask it to sum up numbers.

by trollbridge

5/9/2026 at 3:32:34 PM

I’ve seen people drink their own recycled piss and inject coffee into their ass - what’s your point?

by somewhatgoated

5/9/2026 at 5:41:35 PM

In the first half, I thought you were an astronaut, but the second half has me double-guessing myself.

by sebastiennight

5/9/2026 at 9:51:37 PM

I used to be a connoisseur of weird Facebook groups - I would advise everyone to never look into aged urine, coffee enemas or targeted individuals - makes you lose your faith in humanity

by somewhatgoated

5/9/2026 at 2:25:33 PM

it was, but the formatting was garbage so it ran again to fix thw format.

by cyanydeez

5/10/2026 at 2:55:29 AM

> Delegation requires trust - the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows. DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains, such as coding, crystallography, and music notation. Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.

That's why harnesses and prompting rituals using dozens of markdown down files do not work as advertised and is pretty much snake oil otherwise known as "agentic engineering".

Also, the agentic engineering is pretty much so called prompt engineering except that prompt is now spread across dozens of markdown files directories.

by wg0

5/9/2026 at 5:08:43 PM

What I find fascinating about LLMs is that a lot of their failures seem strikingly similar to the failures that humans struggle with. I’m not sure what this “means” but I think it’s interesting that we can theoretically fix these failures for LLMs but for humans it is much harder. You pretty much need to educate / indoctrinate people for their entire lives and even then it’s messy and unpredictable and prone to failure—just like LLMs.

by rmwaite

5/9/2026 at 2:38:44 PM

This experiment needs to be put in perspective. Let me explain. IF you did this SAME experiment with a human and had a human read an ENTIRE document and then reproduce said document with edits. The DOCUMENT would DEGRADE even more.

The way this experiment is conducted is not inline with how current agentic AI is used OR how even humans edit documents.

Here's how agentic AI currently typically do edits:

1. They read the whole document. 2. They come up with a patch. A diff of the section they want to edit. 3. They change THAT section only.

This is NOT what that experiment was doing. A 25% degradation rate would render the whole industry dead. No one would be using claude code because of that. The reality is... everyone is using claude code.

AI is alien to the human brain, but in many ways it is remarkably. This is one aspect of similarity in that we cannot edit a whole document holistically to produce one edit. It has to be targeted surgical edits rather then a regurgitation of the entire document with said edit.

by threethirtytwo

5/9/2026 at 6:14:51 PM

>IF you did this SAME experiment with a human and had a human read an ENTIRE document and then reproduce said document with edits. The DOCUMENT would DEGRADE even more.

Except that isn't how humans edit documents, and it isn't how LLMs work either.

When a human edits a document, they don't typically "reproduce said document with edits", which I assume you mean read the document and reproduce it from memory. They have the document, either physically printed out, or in a word processor. To make edits they either cross-out and write in the edit, or in a word processor just delete the text and replace it with something better. There's no need to keep the entire document in a human's memory for them to reproduce it from memory.

The same goes for the LLM, it has access to the original document at all times. It can remove sections and replace them.

But the LLM hallucinates.

And if you give a document to a human high on LSD to edit, you might get some weird edits back.

by leptons

5/9/2026 at 9:00:31 PM

>Except that isn't how humans edit documents,

Bro. That's my point.

>and it isn't how LLMs work either.

This is also my point. To be more technical about it, the harness around the LLM pushes it to do surgical edits rather then regurgitation, so my point is this experiment is garbage and testing an impractical and rarely used use case.

>When a human edits a document, they don't typically "reproduce said document with edits", which I assume you mean read the document and reproduce it from memory.

No shit sherlock. The point of that sentence was to illustrate the absurdity of doing that which in turn illustrates the absurdity of this scientific paper. You're kind of lost.

by threethirtytwo

5/9/2026 at 8:31:05 PM

>IF you did this SAME experiment with a human and had a human read an ENTIRE document and then reproduce said document with edits. The DOCUMENT would DEGRADE even more.

I like the idea that imagining somebody doing something in a way that nobody does it because it makes no sense for a person to do it like that is helpful here. It is like

IF you made a human eat an ENTIRE IHOP™ Chicken Fajita Omelette in one bite they would CHOKE and the OMELETTE would go UNDIGESTED. It would get everywhere and the OMELETTE would be RUINED.

by jrflowers

5/9/2026 at 8:56:54 PM

That's the point bro. I am saying this Experiment makes no sense.

Humans don't do that. And Claude doesn't edit documents like that. Because it makes no sense. The point is saying that the Experiment itself is not helpful here.

by threethirtytwo

5/9/2026 at 9:06:51 PM

It is, in fact, pretty common for people to throw a document at a language model along with a “make it more gooder” prompt.

by jrflowers

5/10/2026 at 12:48:12 AM

That was true maybe 7 months ago. This is no longer the case. Harnesses use all kinds of tooling to edit things now.

by threethirtytwo

5/10/2026 at 1:44:28 AM

People paste entire documents into gemini and chat gpt’s text boxes on the web and assume it will all turn out great

edit: apparently got beaten to this

by jrflowers

5/10/2026 at 8:33:56 AM

I don’t understand you. We have an AI model. The AI model is obviously capable.

But you want to use pretend that it’s not useful because non technical people haven’t figured out how to properly use it yet?

Do you think that’s a valid argument? This article is making a claim of 25 percent degredation. Do you think that claim is true because a lot of people don’t use it right?

Humans have 99 percent degredation when editing one punctuation point of an entire book when regurgitating that entire book just to change one punctuation point. Does this statement sound reasonable to you? Because that is the statement you and your genius interloper into this thread are standing behind. Just replace human with LLM and it’s the same kind of genius logic.

by threethirtytwo

5/10/2026 at 1:31:25 AM

I think you’re living in a bubble if you think the average user of AI even knows what a harness is

The vast majority of people are literally going to chatGPT, pasting in their document and asking for edits.

by rune-dev

5/10/2026 at 3:09:56 AM

This will change too man. Maybe I am in a bubble but with how fast things are changing, it won’t be too long before the bubble becomes reality.

Either way we should be doing experiments on the actual capabilities of AI not about the stupidest possible way to use AI because it helps validate your own negative bias against AI.

Additionally as software engineers using agentic AI… which HN basically is… this experiment is not at all relevant in the context of where it is posted. We ALL use agentic ai and we all have the agent use surgical tools for editing. Don’t you find it strange that despite the fact we all do this, HN is full of rabid engineers gobbling this paper up as validation despite complete lack of relevance?

by threethirtytwo

5/10/2026 at 5:17:04 AM

> This will change too man. Maybe I am in a bubble but with how fast things are changing, it won’t be too long before the bubble becomes reality.

You can’t get mad at an experiment for not happening in the future.

> Either way we should be doing experiments on the actual capabilities of AI

They simulated common end user behavior

>because it helps validate your own negative bias against AI.

We’ve gone from “this study is flawed because language models don’t do that” to “this study is flawed because while language models do do that, I don’t think that they will in the future” to “data that could support a bias other than my own is bad”

by jrflowers

5/10/2026 at 8:29:19 AM

> You can’t get mad at an experiment for not happening in the future.

I’m more getting mad at this sentence not making any sense. I’m disappointed at this experiment for not testing the actual capabilities of an LLM. Comprende?

> They simulated common end user behavior

Not the way you use it. And not the way it will be used.

You love it because you want it to stay this way so you can forever believe AI will never be better than you.

Bro the reality is unfolding as you speak. It’s like humanity just discovered guns but hasn’t discovered the bullets and your saying guns are useless because most of humanity hasn’t figured out bullets yet.

> We’ve gone from “this study is flawed because language models don’t do that” to “this study is flawed because while language models do do that, I don’t think that they will in the future” to “data that could support a bias other than my own is bad”

This is a flat out lie. Models DO do that. The only fucking argument you have is that non technical and average laymen people edit documents the wrong way while all people who use agentic AI as adepts use it the correct way. Like are you fucking kidding me?

The only change I acknowledge is your grandma copies and pastes essays into ChatGPT while YOU don’t. You go pretend you live in that reality where the bullets will never appear.

by threethirtytwo

5/10/2026 at 10:00:07 AM

>You love it because you want it to stay this way so you can forever believe AI will never be better than you.

>Bro the reality is unfolding as you speak

>You go pretend you live in that reality where the bullets will never appear.

It’s too late bro, roko’s basilisk was real and it’s already punishing you

by jrflowers

5/10/2026 at 6:34:34 PM

Stick with the argument.

When I said the experiment is inaccurate to the current abilities of AI it’s fucking right. Admit it and stop going off tangents.

There’s no argument against this. You’re dodging and weaving trying to dodge reality. I don’t know who roko is and I don’t give a shit.

by threethirtytwo

5/10/2026 at 11:36:22 PM

There isn’t an argument. We agree that

> Models DO do that.

and I haven’t challenged that this doesn’t sit comfortably with your opinions about the future. I believe that you feel that way, nobody is arguing that you don’t

by jrflowers

5/10/2026 at 11:23:19 AM

First off, It’s good to study all kinds of things isn’t it? Even if it’s not strictly practical.

Second, and more importantly these AI tools are EVERYWHERE right now. The effects of people using them for work can be seen throughout many industries and workplaces.

So I think studying how these models perform in the vast majority of use cases is not only a good idea, but it’s actually really important.

Even if you’re strictly pro-AI and believe it is the future, a study like this can help you explain to laymen why they need the harnesses you’re so in support of.

by rune-dev

5/10/2026 at 9:40:47 PM

> First off, It’s good to study all kinds of things isn’t it? Even if it’s not strictly practical.

Course it is. But the conclusion everyone is coming to is that LLMs are garbage and can’t be used because of 25 percent degradation which is not in line with reality.

> Second, and more importantly these AI tools are EVERYWHERE right now. The effects of people using them for work can be seen throughout many industries and workplaces.

At 25 percent degradation these tools would not be everywhere. They are everywhere because it’s not actually used that way.

> So I think studying how these models perform in the vast majority of use cases is not only a good idea, but it’s actually really important.

I have less of a problem with this study and more about the interpretation of this study.

> Even if you’re strictly pro-AI and believe it is the future, a study like this can help you explain to laymen why they need the harnesses you’re so in support of.

I’m not pro-AI. I’m anti AI. I fucking hate fucking AI.

What I’m angry at is this delusional denial of reality. This experiment is very obviously not accurate yet people are using this study as a headliner to promote an anti AI agenda.

I don’t like AI but that’s different for lying to myself or trying to say AI sucks at something when it is in fact superior then us in this respect.

by threethirtytwo

5/9/2026 at 6:45:58 PM

Benjamin Franklin famously taught himself to write well by doing what you describe: Read a piece of a book, then rewrite it, then compare.

At first his copies were badly degraded. Eventually, he was considered one of the best writers of his time.

I feel like there's probably some way "the copy is better" could be quantified (at least to the point where it fools most of the people most of the time). If so, then expect LLMs to learn the same trick within a generation or two.

by hedora

5/9/2026 at 4:01:44 PM

LLM editing should be done to produce deterministic output.

That is, the LLM should produce a diff, and the user should accept the diff. It seems like a bad pattern to just tell the LLM to edit any long document without that sort of visibility. Same goes for prose as for code.

by andrewljohnson

5/9/2026 at 6:17:34 PM

I always thought it was a little weird that LLMs aren't sophisticated enough to surgically edit files as needed.

For example, if there is a code block that needs to be wrapped within another function call, it'll rewrite the entire function call and you'll just have to pray that the re-written code block wasn't subtly changed.

I _think_ so far it hasn't introduced any changes....

by julianlam

5/9/2026 at 6:25:53 PM

You can just look at the diff when you do a pull request, no prayer needed, and if you want it to be “surgical” in that way, your prompt (and agents.md) can be specific.

You can also unit test the function to better assure behavior didn’t change.

by andrewljohnson

5/9/2026 at 9:44:21 PM

Indeed, that's what I do. I inspect the diff, though if it's an indentation change the entire block will be marked changed.

Still not an excuse to not read every line of course...

Unit tests give me the confidence that at least those tested logic paths are unaffected.

Sometimes with older codebases one cannot assume the paths have adequate test coverage.

by julianlam

5/11/2026 at 12:09:41 AM

Many diff tools, such as delta (https://github.com/dandavison/delta) and the ones built into VS Code and IDEs by JetBrains, can configured to highlight changes within each line (by word) and ignore changes to whitespace. Those features save me a lot of time when I review diffs that include indentation changes or variable renames.

by roryokane

5/9/2026 at 6:04:14 PM

This gets skipped because continual approvals break up user flow so we let LLMs make a few hundred line diffs then a user does a bulk review, and can just revert all/partially. It's naieve to assume user will review every LOC in every instance.

by alansaber

5/9/2026 at 6:17:15 PM

I’m fine with bulk review, it just has to get reviewed before a merge. You don’t need to review the LLM output as you work except as it aids you to work.

by andrewljohnson

5/10/2026 at 1:09:13 AM

Doesnt this apply to humans as well? Thats why children play the game "Telephone" and watch as a message gets corrupted. The solution is to provide single source of truth.

by charlie90

5/9/2026 at 7:08:09 PM

When AI generates code, we have the ability to easily verify it and test it.

The same is not so easy with free form text. I have been thinking about this mainly around when agents write plans or edit plans, but I think figuring out how to do this in general would be a huge breakthrough.

Logical English was one idea I came across and Runcible https://runcible.com/ was another idea I recently stumbled on.

by tmaly

5/10/2026 at 3:58:46 AM

Remind yourselves that most research papers are written by career students with no real world practical experience. That is all.

by Art9681

5/10/2026 at 6:57:07 AM

Spending some time in and around applied research labs and seeing how poorly the sausage looks before it gets made into a paper is quite distressing.

I’m sure there are labs out there doing excellent work (especially those focused on theory), but most of the applied research I’ve seen up close and personal is very poor indeed.

by LPisGood

5/9/2026 at 3:13:35 PM

I thought this was going to be about a problem we saw recently. Someone used an LLM to update the comment block at the start of each source file, and the LLM programmed its own tool that ended up changing ALL of the line endings when it output again with the corrected comment block. Instead of an LLM we could have used find and replace, but people are thinking LLM is the only tool.

by twobitshifter

5/10/2026 at 1:48:17 PM

LLMs corrupt your documents. Proceeds to use AI to WRITE PAPER and DESIGN experiments.

by belabartok39

5/10/2026 at 8:32:10 AM

You can get around the problem by doing a git diff of the unstaged file and a previous commit.

This works well for code regressions but also works for document writing. I've automated it at this point.

A case where using the CLI agent is much better than using the web chat.

by tim-projects

5/10/2026 at 2:09:01 AM

I'm making tools for fighting this kind of degradation: https://github.com/JigSpec/JigSpec

by enrique_mendez

5/9/2026 at 1:26:12 PM

It's an interesting paper, but I'd like to see a lot more about the types of errors that the LLM makes. Are they happening in the forward pass or the inverse pass? My guess is the inverse pass.

by woeirua

5/9/2026 at 7:14:17 PM

This sounds like wishful thinking to me.

The tasks are designed to be reversible. Whether it stochastic parrots in the forward direction or reverse direction is irrelevant. Especially considering these are inference engines. Every pass is a forward pass from the perspective of the LLM / agent. There is no feedback loop, and part of the reason why it's so easy for these things to mangle tasks. They are plausible sounding sentence/sequence generators.

by daveguy

5/10/2026 at 12:51:17 AM

With this paper by Microsoft and the infamous paper by Apple last year, it seems the tech giants that don't have their own models are getting a bit insecure.

by pickleRick243

5/9/2026 at 11:37:49 PM

Simple quick blind test for fun: https://w.merkoba.com/pickabot/

by madprops

5/10/2026 at 7:20:09 AM

I am surprised that not more people talk about this, I once had an ssh key deleted, so unexpected it took me a while to debug.

We live and learn.

Still a huge fan though.

by peter_retief

5/10/2026 at 11:24:28 AM

In my experience there's no longer any good reason to post research papers investigating limitations of LLMs on HN any more because they are always met with one, or all, of the following arguments that have now taken the status of thought-terminating clichés:

1. It's an older model.

2. You're prompting it wrong.

3. That's not what LLMs are for.

4. We knew that already.

It's as if there LLMs have no limitations, which of course goes completely against number 1 in the list above, because if LLMs have no limitations then how are newer models better and why are AI companies constantly releasing new versions?

But the debate has taken on an insidious identitarian character: it's no longer about understanding a technology, its strengths, its limitations, what makes it tick. It's a fractious internet fight between crowds of users who have attached this or that opinion to their very internet persona and will not budge from their entrenched positions.

That is basically the death of curious debate. Obviously there's no point in discussing any research under those conditions: good or bad, flawed or not, we're just not going to get any signal out of the noise on HN anymore.

by YeGoblynQueenne

5/9/2026 at 4:50:41 PM

We don't need a study to tell us that LLMs always make mistakes. We already knew that. Anyone with sense is not using LLMs because of that.

by bigstrat2003

5/9/2026 at 6:01:28 PM

May your contexts always be short

by rao-v

5/9/2026 at 8:21:38 PM

Before I read some of the study, I thought that was relevant too, but each "step [was] conducted as an independent, single-turn session."

by tieTYT

5/12/2026 at 12:18:39 AM

Good point!

by rao-v

5/9/2026 at 3:00:58 PM

this is literally just “leave a child at the work computer with a real doc open playing office”. otoh it is good to design benchmarks tonground these things.

on the flip side if you’re literally just using a bare bones harness on top of a stochastic parrot, of course stochastic errors accumulate.

theres a lot of ways for improving text faithfulness through harness tool designs, and my incremental experiments seem promising.

but unless work is gated on shit like “the script used must type checked ghc haskell or lean4”, unsupervised stuff is gonna decay

by carterschonwald

5/9/2026 at 4:54:59 PM

It’s not a stochastic parrot.

by rhubarbtree

5/9/2026 at 9:05:58 PM

It’s a stochastic goblin.

by kgwgk

5/9/2026 at 11:16:38 PM

good one

by carterschonwald

5/10/2026 at 7:41:37 PM

[flagged]

by theuniverseson

5/9/2026 at 3:41:59 PM

[dead]

by simonreiff

5/10/2026 at 12:02:57 AM

[flagged]

by GhostDriftInc

5/10/2026 at 12:54:59 PM

[flagged]

by chris_explicare

5/9/2026 at 2:54:55 PM

[flagged]

by arian_

5/9/2026 at 10:04:17 PM

[flagged]

by clearstack

5/10/2026 at 11:41:20 AM

[flagged]

by toshikatsu-oga

5/10/2026 at 2:08:49 AM

[flagged]

by xiaosong001

5/10/2026 at 8:13:59 AM

[flagged]

by 30030

5/9/2026 at 5:39:10 PM

[flagged]

by Bmello11

5/9/2026 at 11:41:22 PM

[dead]

by Amber-chen

5/9/2026 at 4:04:46 PM

[flagged]

by OfekSh

5/10/2026 at 9:58:18 AM

[dead]

by Ozzie-D

5/10/2026 at 9:04:32 AM

[flagged]

by 30030

5/9/2026 at 3:58:27 PM

[flagged]

by BrightGirl

5/9/2026 at 11:04:05 PM

[flagged]

by deferredgrant

5/9/2026 at 7:12:49 PM

Yeah so I run my agents as a different user that do not have write perms to my /home

Then I can diff what they wrote with my copy

Users are the OG container. On Linux it's possible to constrain a user to a network namespace, cgroups.

BPF can be used like docker compose to ensure a service running under a user is running

TL;DR a lot of the userspace cruft we import to run software has been rolled into the kernel over the last 10-15 years.

Ignore the terminology "user". Under the hood all the same constraint and boundary setting you want exists without downloading the entire internet

by y3ahd0g