Verification debt: the hidden cost of AI-generated code

3/7/2026 at 6:31:42 PM

Figuring out how to trust AI-written code faster is the project of software engineering for the next few years, IMO.

We'll need to figure out the techniques and strategies that let us merge AI code sight unseen. Some ideas that have already started floating around:

- Include the spec for the change in your PR and only bother reviewing that, on the assumption that the AI faithfully executed it

- Lean harder on your deterministic verification: unit tests, full stack tests, linters, formatters, static analysis

- Get better ai-based review: greptile and bugbot and half a dozen others

- Lean into your observability tooling so that AIs can fix your production bugs so fast they don't even matter.

None of these seem fully sufficient right now, but it's such a new problem that I suspect we'll be figuring this out for the next few years at least. Maybe one of these becomes the silver bullet or maybe it's just a bunch of lead bullets.

But anyone who's able to ship AI code without human review (and without their codebase collapsing) will run circles around the rest.

by fishtoaster

3/7/2026 at 6:56:12 PM

Translating from a natural language spec to code involves a truly massive amount of decision making.

For a non trivial program, 2 implementations of the same natural language spec will have thousands of observable differences.

Where we are today, that is agents require guardrails to keep from spinning out, there is no way to let agents work on code autonomously that won’t end up with all of those observable differences constantly shifting, resulting in unusable software.

Tests can’t prevent this because for a test suite to cover all observable behavior, it would need to be more complex than the code. In which case, it wouldn’t be any easier for machine or human to understand.

The only solution to this problem is that LLMs get better. Personally I think at the point they can pull this off, they can do any white collar job, and there’s not point in planning for that future because it results in either Mad Mad or Star Trek.

by sarchertech

3/7/2026 at 11:36:43 PM

> Tests can’t prevent this because for a test suite to cover all observable behavior, it would need to be more complex than the code. In which case, it wouldn’t be any easier for machine or human to understand.

I don't think "complex" is the right word here. A test suite would generally be more verbose than the implementation, but a lot of the time it can simply be a long list of input->output pairs that are individually very comprehensible and easily reviewable to a human. The hard part is usually discovering what isn't covered by the test case, rather than validating the correctness of the test cases you do have.

by wtallis

3/8/2026 at 2:07:22 AM

At some point verbosity becomes complexity. If you’re talking all observable behavior the input and output pairs are likely to be quite verbose/complex.

Imagine testing a game where the inputs are the possible states of game, and the possible control inputs, and the outputs are the states that could result.

Of course very few human written programs require this level of testing, but if you are trying to prevent an a swarm of agents from changing observable behavior without human review, that’s what you’d need.

Even with simpler input output pairs, an AI tells you it added a feature and had to change 2,000 input/output pairs to do so. How do you verify that those were necessary to change, and how do you verify that you actually have enough cases to prevent the AI from doing something dumb?

Oops you didn’t have a test that said that items shouldn’t turn completely transparent when you drag them.

by sarchertech

3/8/2026 at 2:07:53 AM

Code is like f(x)=ax+b. You test would be a list of (x,y) tuple. You don’t verify the correctness of your points because they come from some source that you hold as true. What you want is the generic solution (the theory) proposed by the formula. And your test would be just a small set of points, mostly to ensure that no one has changed the a and b parameters. But if you have a finite number of points, The AI is more likely to give you a complicated spline formula than the simple formula above. Unless the tokens in the prompts push it to the right domain space. (Usually meaning that the problem is solved already)

Real code has more dimensionality than the above example. Experts have the right keywords, but even then that’s a whole of dice. And coming up with enough sample test cases is more arduous than writing the implementation.

Unless there’s no real solution (dimensionality is high), but we have a lot of tests data with a lower dimensionality than the problem. This used to be called machine learning and we have metrics like accuracy for it.

by skydhash

3/7/2026 at 11:38:50 PM

If some of those input-output pairs are the result of a different interpretation of the spec to other input-output pairs, it's possible that no program satisfies all the tests (or, worse, that a program that satisfies all the tests isn't correct).

by wizzwizz4

3/7/2026 at 7:15:43 PM

>For a non trivial program, 2 implementations of the same natural language spec will have thousands of observable differences.

If they're not defined in the spec then these differences shouldn't matter, they're just implementation details. And if they do matter, then they should be included in the spec; a natural language spec that doesn't specify some things that should be specified is not a good spec.

by logicchains

3/7/2026 at 10:27:41 PM

> we just need to make the spec perfect

So, never.

Greg Kroah-Hartman was once asked by his boss, ”when will Linux be done?” and he said, ”when people stop making new hardware”, that even today, when we assume the hardware won’t lie, much of the work in maintaining Linux is around hardware bugs.

So even at the lowest levels of software development, you can’t know the bugs you’re going to have until you partially solve the problem and find out that this combination of hardware and drivers produces an error, and you only find that out because someone with that combination tried it. There is no way to prevent that by “make better spec”.

But that’s always been true. Basically it’s the 3-body-problem. On the spectrum of simple-complicated-complex, you can calculate the future state of a system if it’s simple, or “only complicated” (sometimes), but you literally cannot know the future state of complex systems without simulating them, running each step and finding out.

And it gets worse. Software ranges from simple to complicated to complex. But it exists within a complex hardware environment, and also within a complex business environment where people change and interest rates change and motives change from month to month.

There is no “correct spec”.

by halfcat

3/7/2026 at 7:36:18 PM

There are a limitless number of implementation details you don't think you care about until they are constantly changing.

I doubt there exists a single piece of nontrivial software today where you could randomly alter 5% of the implementation details while keeping to the spec, without resulting in a flood of support tickets.

by sarchertech

3/7/2026 at 7:56:46 PM

Agreed, but with one exception: are tests supposed to cover all observable behavior? Usually people are happy with just eliminating large/easy classes of bad (unintended) behavior, otherwise they go for formal verification which is an entirely different ballgame.

by Herring

3/7/2026 at 8:39:59 PM

No they aren’t because they can’t (at least not without becoming so complicated that there’s no longer a point).

But humans are much better at reasoning about whether a change is going to impact observable behavior than current LLMs are as evidenced by the fact that LLMs require a test suite or something similar to build a working app longer than a few thousand lines.

by sarchertech

3/7/2026 at 8:40:11 PM

> We'll need to figure out the techniques and strategies that let us merge AI code sight unseen

Every strategy which worked with an off-shore team in India works well for AI.

Sometime in mid 2017, I found myself running out of hours in the day stopping code from being merged.

On one hand, I needed to stamp the PRs because I was an ASF PMC member and not a lot of the folks who were opening JIRAs were & this wasn't a tech debt friendly culture, because someone from LinkedIn or Netflix or EMR could say "Your PR is shit, why did you merge it?" and "Well, we had a release due in 6 days" is not an answer.

Claude has been a drop-in replacement for the same problem, where I have to exercise the exact same muscles, though a lot easier because I can tell the AI that "This is completely wrong, throw it away and start over" without involving Claude's manager in the conversation.

The manager conversations were warranted and I learned to be nicer two years into that experience [1], but it's a soft skill which I no longer use with AI.

Every single method which worked with a remote team in a different timezone works with AI for me & perhaps better, because they're all clones of the best available - specs, pre-commit verifiers, mandatory reviews by someone uncommitted on the deadline, ease of reproducing bugs outside production and less clever code over all.

[1] - https://notmysock.org/blog/2018/Nov/17/

by gopalv

3/7/2026 at 10:27:39 PM

> Every strategy which worked with an off-shore team in India works well for AI.

Why hasn't SWE then not been completely outsourced for 20 years. Corporations were certainly trying hard.

by mentalgear

3/8/2026 at 3:41:23 PM

Cost. Claude code is two orders of magnitude cheaper than an offshore dev.

by arunabha

3/8/2026 at 5:41:25 PM

we are talking 20 - 30 years back when offshore was and still is cheaper.

by mentalgear

3/7/2026 at 7:07:18 PM

> Figuring out how to trust AI-written code faster is the project of software engineering for the next few years, IMO

Replace AI written with “cheap dev written” and think about why that isn’t already true.

The bottleneck is a competent dev understanding a project. Always has been.

Another fundamental flaw is you can’t trust LLMs. It’s fundamentally impossible compared to the way you trust a human. Humans make mistakes. LLMs do not. Anything “wrong” they do is them working exactly as designed.

by ahsisjb

3/7/2026 at 11:40:19 PM

>Humans make mistakes. LLMs do not. Anything “wrong” they do is them working exactly as designed.

This requires a redefinition of the term mistake, no?

by NewsaHackO

3/7/2026 at 6:51:05 PM

> Include the spec for the change in your PR

We would have to get very good at these. It's completely antithetical to the agile idea where we convey tasks via pantomime and post it rather than formal requirements. I wont even get started on the lack of inline documentation and its ongoing disappearance.

> Lean harder on your deterministic verification: unit tests, full stack tests,

Unit tests are so very limited. Effective but not the panacea that the industry thought it was going to be. The conversation about simulation and emulation needs to happen, and it has barely started.

> We'll need to figure out the techniques and strategies that let us merge AI code sight unseen.

Most people who write software are really bad at reading other's code, and doing systems level thinking. This starts at hiring, the leet code interview has stocked our industry with people who have never been vetted, or measured on these skills.

> But anyone who's able to ship AI code without human review

Imagine we made every one go back to the office, and then randomly put LSD in the coffee maker once a week. The hallucination problem is always going to be NON ZERO. If you are bundling the context in, you might not be able to limit it (short of using two models adversarially). That doesn't even deal with the "confidently wrong" issue... what's an LLM going to do with something like this: https://news.ycombinator.com/item?id=47252971 (random bit flips).

We haven't even talked about the human factors (bad product ideas, poor UI, etc) that engineers push back against and an LLM likely wont.

That doesn't mean you're completely wrong: those who embrace AI as a power tool, and use it to build their app, and tooling that increases velocity (on useful features) are going to be the winners.

by zer00eyz

3/7/2026 at 9:51:29 PM

> We'll need to figure out the techniques and strategies that let us merge AI code sight unseen.

Why do you assume that's doable? I'm not saying it's not, but it seems strange to just take for granted that it is.

by gspr

3/7/2026 at 11:08:14 PM

Why do you assume I assume it's doable? :P

For real, I'm not certain we will ever be able to merge AI code without human review. But:

1. Every time I've confidently though "AI will never be able to do X" in the last year, I've later been proven wrong, so I'm a bit wary to assume that again without strong reasons.

2. I see blog posts by some of the most AI-forward people that seems to imply some people are already managing large codebases without human review of raw code. Maybe they're full of crap - there are certainly plenty of over-credulous bs artists in the AI space - but maybe they're not.

3. The returns on figuring this out are so incredibly high that, if it's possible, people will figure it out.

All that to say: it's far from certain, but my bias is that it is possible.

by fishtoaster

3/8/2026 at 2:23:39 PM

> Why do you assume I assume it's doable? :P

Because you say we need to figure out techniques to do it. If it's not possible, then there are no techniques to do it. Since you want the techniques, I assume you assume that they exist.

> 1. Every time I've confidently though "AI will never be able to do X" in the last year, I've later been proven wrong, so I'm a bit wary to assume that again without strong reasons.

That's evidence that you shouldn't assume something is impossible. I'm not suggesting that, either.

> 2. I see blog posts by some of the most AI-forward people that seems to imply some people are already managing large codebases without human review of raw code. Maybe they're full of crap - there are certainly plenty of over-credulous bs artists in the AI space - but maybe they're not.

Do you have any idea whether this works well though?

> 3. The returns on figuring this out are so incredibly high that, if it's possible, people will figure it out.

Ok. But again, that's a big if there.

The returns on breaking a popular cryptographic algorithm are also huge, but that's not an indication that it's possible, or that it's impossible for that matter.

I'm baffled why people think that "it would be great if..." has any bearing on the chances that the thing that follows is true.

by gspr

3/7/2026 at 11:57:06 PM

1. Every time I've confidently stated "this AI architecture will never be able to do X" in the past 6 years, I've not been proven wrong (with one possible exception earlier today: https://news.ycombinator.com/item?id=47291893 – the jury's still out on that one). … No, my version doesn't really work, does it? It just sounds like bragging, or maybe hubris.

> some people are already managing large codebases without human review of raw code.

2. I have never believed this to be impossible. I do, however, maintain that these codebases are necessarily some combination of useless, plagiarism, and bloated. I have yet to see a case where there isn't a smaller, cheaper way to accomplish the same task faster and better.

> The returns on figuring this out are so incredibly high

3. And yet, they still haven't figured it out. My bias is that it isn't possible, because nothing has fundamentally changed about the model architectures since I first skimmed a PDF about GPT, and imagined an informal limiting proof that I still haven't found any holes in.

by wizzwizz4

3/7/2026 at 8:03:08 PM

What is this obsession with specifications? For a start it’s certainly not fair to assume an LLM has translated it into correct code, even if there is one reasonable way to do so, and there probably isn’t. I like a good, well-targeted spec as much as anyone, but come on. A spec detailed enough to describe a program is more-or-less the program but written in a non-executable language. I want to review the code, not a spec.

by dwb

3/7/2026 at 8:56:03 PM

I can't help but think that the logical conclusion of spec-first development is a return to Waterfall methodology. The amount of rigour required almost entirely negates the speed advantages of LLMs, even in the hands of seasoned developers. Unless the stakeholders are external, there will always be that necessary organisational bottleneck; of course, the C-suite could always decide to foist project management entirely on individual contributors, or take it on themselves, but I see that ending either in burnout or eventual neglect. All in the service of being on the forefront of adoption, and for what end?

by xantronix

3/7/2026 at 9:35:51 PM

Yeah, agree. Either that or this idea of not reviewing the code at all takes hold, abdicating human engineering responsibility to the machines, until some big stupid disaster or when it’s Too Late.

by dwb

3/8/2026 at 1:59:22 AM

Is the teleological fight. Do swe decide what the purpose of the system is or do non technical people?

Intention flows are important

by whattheheckheck

3/8/2026 at 8:10:06 AM

I’m not talking about which humans decide the purpose of the system, or even which humans engineer the system once designed at a higher level. I’m worried about leaving crucial decisions and understanding to LLMs, with humans just stepping back.

by dwb

3/7/2026 at 6:36:50 PM

>Lean harder on your deterministic verification: unit tests, full stack tests, linters, formatters, static analysis

It's wild that the gamut of PRs being zipped around don't even do these. You would run such validations as a human...

by orsorna

3/7/2026 at 8:42:57 PM

> Figuring out how to trust AI-written code faster is the project of software engineering for the next few years, IMO.

Or we could actually, you know, stop using a tool that doesn't work. People are so desperate to believe in the productivity boosts of AI that they are trying to contort the whole industry around a tool that is bad at its job, rather than going "yeah that tool sucks" and moving on like a sane person would.

by bigstrat2003

3/8/2026 at 2:04:55 AM

Neural nets sucked in 1960s and if they gave up then we wouldn't be here

by whattheheckheck

3/8/2026 at 2:19:04 AM

And the Concorde did not replace normal jet travel.

by habinero

3/7/2026 at 6:38:00 PM

Do you know what happens to every industry when they get too fast and slapdash?

Regulation.

It happened with plumbing. Electricians. Civil engineers. Bridge construction. Haircutting. Emergency response. Legal work. Tech is perhaps the least regulated industry in the world. Cutting someone’s hair requires a license, operating a commercial kitchen requires a license, holding the SSN of 100K people does not yet.

If AI is fast and cheap, some big client will use it in a stupid manner. Tons of people can and will be hurt afterward. Regulation will follow. AI means we can either go faster, or focus on ironing out every last bug with the time saved, and politicians will focus on the latter instead of allowing a mortgage meltdown in the prime credit market. Everyone stays employed while the bar goes higher.

by gjsman-1000

3/7/2026 at 6:56:40 PM

He’s right. Exhibit A is age-gating social media. If the industry keeps being this careless that’s going to be the tip of the iceberg.

by coffeefirst

3/8/2026 at 12:57:15 AM

It's not just going to be software. We will absolutely be experiencing vibe law, vibe medicine, vibe legislation even. It'll be so much vibing that it's not worth saying the word anymore.

by threatofrain

3/7/2026 at 6:52:57 PM

> Regulation will follow.

I would hope so, but it won't happen as long as the billionaire AI bros keep on paying politicians for favorable treatment.

by hackyhacky

3/7/2026 at 7:01:09 PM

The word is "bribing", and the current (bribable) administration won't be around forever (hopefully).

by leptons

3/8/2026 at 11:16:47 AM

At the very least, the EU will regulate, and most other countries will copy. At that point the US will either need to regulate or watch it's software exports go to zero, with consequent impacts on their stock markets.

I've been saying this for a while, but within a generation, software will be as regulated as finance.

by disgruntledphd2

3/7/2026 at 11:39:52 PM

Very well said.

I think that "deciding what types of code can be reliably handed off to AI" might be missing from the list. It's orders of magnitude easier to nail 80% all the time than 100% all the time. I could see standalone products even developing in this space.

by px1999

3/7/2026 at 7:38:36 PM

My bet is that the last item is what we’ll end up leaning heavily on - feels like the path of least resistance

Throw in some simulated user interactions in a staging environment with a bunch of agents acting like customers a la StrongDM so you can catch the bugs earlier

by pjm331

3/7/2026 at 7:04:01 PM

I made a distributed operating system that manages all of this. Not just for agents per se but in general allows many devs to work simultaneously without tons of central review and allows them to keep standards high while working independently.

by user3939382

3/7/2026 at 7:05:09 PM

[dead]

by Copyrightest

3/7/2026 at 5:44:54 PM

This still seems like technical debt to me. It's just debt with a much higher compounding interest rate and/or shorter due date. Credit cards vs. traditional loans or mortgages.

>And six months later you discover you’ve built exactly what the spec said — and nothing the customer actually wanted.

That's not a developer problem, it's a PM/business problem. Your PM or equivalent should be neck deep in finding out what to build. Some developers like doing that (likely for free) but they can't spend as much time on it as a PM because they have other responsibilities, so they are not as likely not as good at it.

If you are building POCs (and everyone understands it's a POC), then AI is actually better getting those built as long as you clean it up afterwards. Having something to interact with is still way better than passively staring at designs or mockup slides.

Developers being able to spend less time on code that is helpful but likely to be thrown away is a good thing IMO.

by hnthrow0287345

3/7/2026 at 6:13:35 PM

> AI is actually better getting those built as long as you clean it up afterwards

I've never seen a quick PoC get cleaned up. Not once.

I'm sure it happens sometimes, but it's very rare in the industry. The reality is that a PoC usually becomes "good enough" and gets moved into production with only the most perfunctory of cleanup.

by lowsong

3/7/2026 at 7:01:22 PM

The key to every quick POC having a short life, is a reliance on manual work outside of the engineering team.

by gregoryl

3/8/2026 at 9:20:35 AM

I want to know more. Can you give an example? :D

by tpoacher

3/7/2026 at 7:13:16 PM

This is genius.

by dwaltrip

3/7/2026 at 7:26:09 PM

One trick for avoiding this is to use artifacts in the PoC that no self-respecting developer would ever allow in production. I use html tables in PoCs because front-end devs hate them - with old-school properties like cellpadding that I know will get replaced.

I also name everything DEMO__ so at least they'll have to go through the exercise or renaming it. Although I've had cases where they don't even do that lol. But at least then you know who's totally worthless.

by suzzer99

3/7/2026 at 6:25:55 PM

There is nothing as permanent as a temporary solution!

by somewhereoutth

3/8/2026 at 9:19:41 AM

Yes. Or, "Prototypes have a bad habit of becoming the product".

by tpoacher

3/7/2026 at 6:53:32 PM

Bad code isn't Technical Debt, it's an unhedged Call Option

If you search for that quote, you can find #1 result is an AI slop paraphrase published last week, but the original article was 11 years ago, and republished 3 years ago.

https://higherorderlogic.com/programming/2023/10/06/bad-code...

by gowld

3/8/2026 at 12:16:45 AM

My current stance with reviewing code is: It's not ok to make another human review the code you made with AI, if you used AI then you're the reviewer, so unless you come to me with a well defined question or decision to make, just merge it and take responsibility.

Obviously that could only work in a high trust environment, that why open source suffers so much with AI submissions.

by SPascareli13

3/9/2026 at 2:14:56 PM

There's a related but distinct problem downstream: once the agent is running in production, verification debt shifts from code to execution. Internal logs of what the agent called and what it received are mutable — if a provider disputes delivery or compliance requires an audit trail, "we have logs" is a weak defense. The deterministic verification (tests, linters, CI) handles the code side. The execution side is a different problem: you need immutable witnesses at call time, before the agent proceeds, not post-hoc reconstructions.

by saltpath

3/7/2026 at 7:11:51 PM

Verification debt has always been present, we just now feel an acute need for it, because we do it wrong.

Clause and friends represent an increase in coders, without any corresponding increase in code reviewers. It's a break in the traditional model of reviewing as much code as you submit, and it all falls on human engineers, typically the most senior.

Well, that model kinda sucked anyways. Humans are falliable and Ironies of Automation lays bare the failure modes. We all know the signs: 50 comments on a 5 line PR, a lonely "LGTM" on the 5000 line PR. This is not responsible software engineering or design; it is, as the author puts it, a big green "I'm accountable" button with no force behind it.

It's probably time for all of us on HN to pick up a book or course on TLA+ and elevate the state of software verification. Even if Claude ends up writing TLA+ specs too, at least that will be a smaller, simpler code base to review?

by jldugger

3/8/2026 at 6:03:04 AM

Will the TLA+ spec Claude spits out do what the users actually desire? Will there be human oversight of the spec? If not, I don't see how it really helps if the future human machine interface is supposed to be loosey goosey natural language. The best thing I can conceive is some human observers of the system saying "Claude, the behavior as it stands now is perfect! Set it in stone with TLA+." But this whimsical idea has many problems.

by pcloadlett3r

3/8/2026 at 7:23:26 PM

> Will there be human oversight of the spec?

Well that's my suggestion, but I suppose nothing rules out an "all apprentices, no sorcerers" failure mode.

by jldugger

3/8/2026 at 9:28:09 AM

Software is a huge collection of tiny, curated details.

Verifying that they all work can be done in many ways, most of them high-touch - but to me the most effective way is to build a test suite.

And the best way to get a test suite while building is Test Driven Development TDD (with the key trait that you witnessed the tests fail before making them pass, giving you proof they actually prove something about your code) is a high leverage way to ensure details are documented and codified in a way that requires “zero tokens at rest”. If a test fails, something has been un-built; something has regressed. Conversely if all tests pass, your agent burned zero tokens learning that.

The industry will keep inventing other solutions but we have this already, so if you’re in the know, you should use it.

If you’re wondering how to get started you (or your agent) can crib ideas from what I’ve done & open sourced: https://codeleash.dev/docs/tdd-guard/

by cadamsdotcom

3/7/2026 at 6:18:50 PM

Verification has always been hard and always ignored, in software more than other industries. This is not specific to AI generated code.

I currently work in a software field that has a large numerical component and verifying that the system is implemented correctly and stable takes much longer than actually implementing it. It should have been like that when I used to work in a more software-y role, but people were much more cavalier then and it bit that company in the butt often. This isn't new, but it is being amplified.

by ironman1478

3/7/2026 at 5:31:42 PM

> It gets 50% more pull requests, 50% more documentation, 50% more design proposals

Perhaps this will finally force the pendulum to swing back towards continuous integration (the practice now aliased trunk-based development to disambiguate it from the build server). If we're really lucky, it may even swing the pendulum back to favoring working software over comprehensive documentation, but maybe that's hoping too much. :-)

by Kerrick

3/8/2026 at 12:07:17 AM

I've been spending much less time on reviews lately. I used to check if the code was correct and well-written, and worked on my local machine as expected and performed well. But I can't do it anymore. If they can vibe-code, why can't I vibe-review? Maybe something wrong will happen in production, but it's not my responsibility. I also stopped volunteering for on-call (well, I shouldn't in the first place). If I noticed someone reporting a bug in production during non-working hours, I investigated and implemented the solution, usually faster than coworkers. I thought it was my responsibility to contribute to the product if I could, even though it was beyond my job description. Working with AI-generated code really demoralized me and I can't love the product I'm working on anymore.

by hamasho

3/7/2026 at 11:01:21 PM

With a CS degree and 15 years of software engineering under my belt, I was initially skeptical of 'vibe coding'. But the article is right about this adolescent phase. I recently built my platform (https://voix.chat) 100% through agentic workflows. Having that much experience meant I didn't use the AI as a crutch to learn how to code; I used it as a hyper-productive junior dev while I played the paranoid senior architect. It allowed me to focus purely on the hard stuff: strict anti-flood mechanisms, brute-force protection, and overall server hardening. The AI handles the syntax; the human handles the paranoia.

by talkvoix

3/7/2026 at 11:43:21 PM

Your FAQ page is in Portuguese, even though my language is set to English. Changing the language does not seem to change the FAQ. Did you forget to localize this?

by NewsaHackO

3/8/2026 at 12:18:25 AM

I just fixed that! Thank you so much for letting me know. I had forgotten to upload that commit, haha!

Could you refresh the page and test it?

by talkvoix

3/8/2026 at 2:25:48 PM

Everyone is circling around this. We are shifting to "code factories" that take user intent in at one end and crank out code at the other end. The big question: can you trust it?

We're building our tooling around it (thanks, Claude!) and seeing what works. Personally, I have my own harness and I've been focused on 1) discovering issues (in the broadest sense) and 2) categorizing the issues into "hard" and "easy" to solve inside the pipeline itself.

I found patterns in the errors the coding agents made in my harness, which I then exploited. I have an automated workflow that produces code in stages. I added structured checks to catch the "easy" problems at stage boundaries. It fixes those automatically. It escalates the "hard" problems to me.

In the end, this structure took me from ~73% first-pass to over 90%.

by mrothroc

3/7/2026 at 5:54:20 PM

This verification problem is general.

As an experiment, I had Claude Cowork write a history book. I chose as subject a biography of Paolo Sarpi, a Venetian thinker most active in the early 17th century. I chose the subject because I know something about him, but am far from expert, because many of the sources in Italian, in which I am a beginner, and because many of the sources are behind paywalls, which does not mean the AIs haven't been trained on them.

I prompted it to cite and footnote all sources, avoid plagiarism and AI-style writing. After 5 hours, it was finished (amusingly, it generated JavaScript and emitted a DOCX). And then I read the book. There was still a lingering jauntiness and breathlessness ("Paolo Sarpi was a pivotal figure in European history!") but various online checkers did not detect AI writing or plagiarism. I spot checked the footnotes and dates. But clearly this was a huge job, especially since I couldn't see behind the paywalls (if I worked for a Uni I probably could).

Finally, I used Gemini Deep Research to confirm the historical facts and that all the cited sources exist. Gemini thought it was all good.

But how do I know Gemini didn't hallucinate the same things Claude did?

Definitely an incredible research tool. If I were actually writing such a book, this would be a big start. But verification would still be a huge effort.

by johngossman

3/7/2026 at 6:20:41 PM

I used gemini to look up a relative with a connection to a famous event. The relative himself is obscure, but I have some of his writings and I've heard his story from other relatives. Gemini fabricated a completely false narrative about my relative that was much more exciting than what actually happened. I spent a bunch of time looking at the sources that Gemini supplied trying to verify things and although the sources were real, the story Gemini came up with was completely made up.

by apical_dendrite

3/7/2026 at 6:24:58 PM

Yup. I've had Gemini create fake citations to papers. I've also had it hallucinate the contents of paywalled papers, so I know I can't trust anything it writes, though I am getting better at using it recursively to verify things.

by johngossman

3/7/2026 at 7:11:16 PM

I am certain I read article that was posted on YN a month or so ago about some researchers that were caught using false citations in their research.

If I remember correctly, some group used an AI tool to sniff for AI citations in other's works. What I remember most was how abhorrent some of the sources that the AI sniffer caught. One of the citation's authors was literally cited as "FirstName LastName" -- didn't even sub in a fake name lol.

Edit: I found the OP:

https://news.ycombinator.com/item?id=46720395

by hirvi74

3/7/2026 at 7:00:11 PM

I believe that, on a fundamental level, the principle of 'trust, but verify' can be followed to its logical endpoint, as covered in Ken Thompson's lecture, 'Reflections on Trusting Trust' [1]. At some point, one simply has to trust that something is correct, unless they have the capability to verify every step of a long chain of indirection.

So, in regard to your book: Claude may or may not have hallucinated the information from its cited sources. Gemini, as well. However, say you had access to the cited information behind a paywall. How would you go about verifying the information cited in those sources was correct?

Since the release of LLMs over the past four years or so, I have noticed a trend where people are (rightfully) hesitant to trust the output of LLMs. But if the knowledge is in a book or comes from another other man-made source, it's some how infallible? Such thinking reminds me of my primary schooling days. Teachers would not let us use Wikipedia as a source because, "Anyone can edit anything." Though, it's as one cannot write anything they want in a book -- be it true or false?

How many scientific researchers have p-hacked their research, falsified data, or used other methods of deceit? I do not believe it's a truly an issue on a grand scale nor does it make vast amounts of science illegitimate. When caught, the punishments are usually handled in a serious manner, but no telling how much falsified research was never caught.

I do believe any and all information provided by LLMs should be verified and not blindly trusted, however, I extend that same policy to works from my fellow humans. Of course, no one has the time to verify every single detail of every bit of information one comes across. Hence, at some point, we all must settle on trusting in trust. Knowledge that we cannot verify is not knowledge. It is faith.

[1] https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Ref...

by hirvi74

3/7/2026 at 7:55:23 PM

This is great, your final line summarizes my thoughts as well. When it comes to matters of faith your average Redditor and Hacker News commenter will heap scorn and derision on religious people for accepting things blindly without any proof, yet they will blindly accept what other people tell them is true, or now what an LLM says is true.

by regus

3/8/2026 at 2:49:39 AM

> But if the knowledge is in a book or comes from another other man-made source, it's some how infallible

Nobody who's ever done research believes that. Everything gets put along a spectrum of trust/accuracy.

You should be able to say what you believe, what you base that belief on, what it would take to disprove that belief, and how likely you think it is to be disproven.

That's why you do research from as many primary sources as you can, because yeah, otherwise you're reading someone else's interpretation. Sometimes you can't do that (you don't read the language, etc) and then you have to judge the quality of the interpretation.

It's an enormous amount of work to write a book, and making things up doesn't make that process a whole lot easier. So most people try to be accurate. Especially with editors and such doublechecking work. I still always judge the quality of the work as I'm reading it.

LLMs just flat out can't be trusted. They're endless fountains of words and aren't accurate by nature. They're fine if you already know the answer, and not fine if you don't.

by habinero

3/7/2026 at 6:46:39 PM

Before AI, the smartest human still had to pass the paywall to access paywalled content.

AI has exacerbated the Internet's "content must be free or else does not exist" trend.

It's just not interesting to challenge an AI to write professional research content without giving it access to research conetent. Without access, it's just going to paraphrase what's already available.

by gowld

3/7/2026 at 6:17:35 PM

Verification is the bottleneck now, so we have to adjust our tooling and processes to make verification as easy as possible.

When you submit a PR, verifiability should be top of mind. Use those magic AI tools to make the PR as easy to possible to verify as possible. Chunk your PR into palatable chunks. Document and comment to aid verification. Add tests that are easy for the reviewer to read, test and tweak. Etc.

by bryanlarsen

3/7/2026 at 6:43:11 PM

Just prompt the AI to verify the software.

by gowld

3/7/2026 at 6:01:43 PM

My company recently hired a contractor. He submits multi-thousand line PRs every day, far faster than I can review them. This would maybe be OK if I could trust his output, but I can't. When I ask him really basic questions about the system, he either doesn't know or he gets it wrong. This week, I asked for some simple scripts that would let someone load data in a a local or staging environment, so that the system could be tested in various configurations. He submitted a PR with 3800 lines of shell scripts. We do not have any significant shell scripts anywhere else in our codebase. I spent several hours reviewing it with him - maybe more time than he spent writing it. His PR had tons and tons of end-to-end tests of the system that didn't actually test anything - some said they were validating state, but passed if a get request returned a 200. There were a few tests that called a create API. The tests would pass if the API returned an ID of the created object. But they would ALSO pass if the test didn't return an ID. I was trying to be a good teacher, so I kept asking questions like "why did you make this decision", etc, to try to have a conversation about the design choices and it was very clear that he was just making up bullshit rationalizations - he hadn't made any decisions at all. There was one particularly nonsensical test suite - it said it was testing X but included API calls that had nothing to do with X. I was trying to figure out how he had come up with that, and then I realized - I had given him a Postman export with some example API requests, and in one of the API requests I had gotten lazy and modified the request to test something but hadn't modified the name in Postman. So the LLM had assumed that the request was related to the old name and used it when generating a test suite, even though these things had nothing to do with each other. He had probably never actually read the output so he had no idea that it made no sense.

When he was first hired, I asked him to refactor a core part of the system to improve code quality (get rid of previous LLM slop). He submitted a 2000+ line PR within a day or so. He's getting frustrated because I haven't reviewed it and he has other 2000+ line PRs waiting on review. I asked him some questions about how this part of the system was invoked and how it returned data to the rest of the system, and he couldn't answer. At that point I tried to explain why I am reluctant to let him commit his refactor of a core part of the system when he can't even explain the basic functionality of that component.

by apical_dendrite

3/7/2026 at 6:50:59 PM

I expect you'll be seen as the problem for slowing an obviously productive person down. What a time to be alive :(

by metajack

3/7/2026 at 6:48:13 PM

Why are you paying someone who isn't doing the job you hired someone to do?

Why are you acting like you work for the contractor, instead of the contractor workign for you?

Why are you teaching a contractor anything? That's a violation of labor law. You are treating a contractor like an employee.

by gowld

3/7/2026 at 6:55:08 PM

Excellent questions.

by apical_dendrite

3/7/2026 at 11:48:07 PM

I'm observing pretty much the same pattern in my job. The sad truth is, people -especially non-technical- get too easily impressed by vibe-coded projects or contributions made in a few hours, because it's shiny and it gives the impression of a productivity boost. Don't you dare asking how that is supposed to scale, if it's secure or even extensible, or you'll be the one killing the mood in the room. Even though that's precisely the hard part of the job.

by axi0m

3/7/2026 at 6:27:15 PM

Do you think he used AI to generate that much code without ever understanding or having a look at the code ? Why was he hired ?

by lpnam0201

3/7/2026 at 6:31:05 PM

Yes, because he can't answer basic questions about the code.

He was hired because we needed a contractor quickly and he and his company represented to us that he was a lot more experienced than he actually is.

by apical_dendrite

3/7/2026 at 6:46:11 PM

Will you get rid of him? It sounds like he's wasting a lot of your time

by afro88

3/7/2026 at 7:17:41 PM

Or... is apical_dendrite just circling the wagons, scared of AI taking his job?

/management thoughts

by suzzer99

3/7/2026 at 6:38:21 PM

This sums up the inherent friction between hype and reality really well.

CEOs and hype men want you to believe that LLMs can replace everyone. In 6 months you can give them the keys to the kingdom and they'll do a better job running your company then you did. No more devs. No more QA. No more pesky employees who needs crazy stuff like sleep, and food, and time off to be a human.

Then of course we run face first into reality. You give the tool to an idiot (or a generally well meaning person not paying enough attention) and you end up with 2k PRs that are batshit insane, production data based deleted, malicious code downloaded and executed on just machines, email archives deleted, and entire production infrastructure systems blown away. Then the hype men come back around and go "well yeah, it's not the tools fault, you still need an expert at the wheel, even though you were told you don't".

LLMs can do amazing things, and I think there's a lot of opportunities to improve software products if used correctly, but reality does not line up with the hype, and it never will

by scuff3d

3/7/2026 at 6:49:21 PM

> CEOs and hype men want you to believe that LLMs can replace everyone.

> they'll do a better job running your company

SWEs are the ones running the company.

CEOs are.

by gowld

3/7/2026 at 7:30:56 PM

I was being hyperbolic to make a point. Not literal.

by scuff3d

3/8/2026 at 4:24:03 PM

> Mostly I’m just there to press the big “I’m accountable” button on the screen

This is going to be way harder now vs. when we used to write the code ourselves. In contracting space, the problem now is that you may have a client that vibe coded an app and be very out of touch about the costs involved to have a developer approve it. It's going to be a hard sell, when the client builds the entire thing themselves and you are a mare peasant doing QA review.

by devld

3/7/2026 at 6:46:27 PM

It comes down to trust. I was not able to trust GPT 4.1 or Sonnet 3.5 with anything other than short, well-specified tasks. If I let them go too long (e.g. in long Cursor sessions), it would lose the plot and start thrashing.

With better models and harnesses (e.g. Claude Code), I can now trust the AI more than I would trust a junior developer in the past.

I still review Claude's plans before it begins, and I try out its code after it finishes. I do catch errors on both ends, which is why I haven't taken myself out of the loop yet. But we're getting there.

Most of the time, the way I "verify" the code is behavioral: does it do what it's supposed to do? Have I tried sufficient edge cases during QA to pressure-test it? Do we have good test coverage to prevent regressions and check critical calculations? That's about as far as I ever took human code verification. If anything, I have more confidence in my codebases now.

by bensyverson

3/7/2026 at 10:07:13 PM

good moooorning sir

by cuntiusmccunt

3/8/2026 at 1:41:57 PM

I've noticed something similar using AI for coding. Writing the code becomes faster, but checking what it actually does takes longer than expected. Sometimes you spend more time reading and testing the generated code than writing it yourself. The speed is real, but the verification part doesn’t go away.

by veloryn

3/7/2026 at 7:58:31 PM

Both empirically and theoretically, verification is often much more tractable than discovery.

Software development is a highly complex task and verification becomes not just validation of the output but also verification that the work is solving the problem desired, not just the problem specified.

I'm empathetic to that scenario, but this was a problem with software development to begin with. I would much rather be in a situation of reducing friction to verification than reducing friction to discovery.

Cognitive load might be the same but now we get a potential boost in productivity for the same cost.

by abetusk

3/7/2026 at 6:36:19 PM

Historically, the cycle has been requirements -> code -> test, but with coding becoming much faster, the bottlenecks have changed. That's one of the reasons I've been working on Spark Runner to help automate testing for web apps: https://https://github.com/simonarthur/spark-runner

by chromaton

3/7/2026 at 5:39:34 PM

Code is fully disposable way to generate custom logic.

Hand crafted , scalable code will be a very rare phenomenon

There will be a clear distinction between too.

by maxdo

3/7/2026 at 10:22:45 PM

> Output is mind-numbingly verbose. You ask for a focused change and get a dissertation with unsolicited comments and gratuitous refactoring.

Recent Devstral 2 (mistral) is pretty precise and concise in it's changes.

by mentalgear

3/7/2026 at 8:30:10 PM

At the end of the day, it's about liability. Whether you use AI tools to generate the code or not, you are the author of the code, and such authorship implies the liability that you are being paid to take.

by ritcgab

3/7/2026 at 10:09:31 PM

We were verifying code before? And wouldn't AI help with verification at least for the trivial flaws?

by poemxo

3/7/2026 at 6:00:55 PM

I've come to the point where I think generated code is nothing better than a random package I install. Did I read it all and just accepted what was promised? Yes Can it bite me in the butt somewhere down the road? Probably, but I currently at least have more doubt about the generated code than a random package I picked up somewhere on git which readme I just partly skipped over.

by VanTodi

3/7/2026 at 6:30:03 PM

However a random [but well established] package will have been used many many times, thus will have been verified in the wild, and likely will have a bug tracker, updates, and perhaps even a community of people who care about that particular code. No comparison really.

by somewhereoutth

3/8/2026 at 7:23:43 PM

[dead]

by irenetusuq

3/7/2026 at 6:33:28 PM

[dead]

by aplomb1026

3/7/2026 at 10:03:47 PM

[dead]

by ClaudioAnthrop

3/7/2026 at 11:13:02 PM

[flagged]

by decker_dev