When AI writes the software, who verifies it?

3/4/2026 at 1:59:24 AM

> The Claude C Compiler illustrates the other side: it optimizes for

> passing tests, not for correctness. It hard-codes values to satisfy

> the test suite. It will not generalize.

This is one of the pain points I am suffering at work: workers ask coding agents to generate some code, and then to generate test coverage for the code. The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code. At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").

The icing on the cake is that LLMs are producing so much code that humans are just rubber stamping all of it. Off to merge and build it goes.

I have no constructive recommendations; I feel the industry will keep their foot on the pedal until something catastrophic happens.

by roadbuster

3/4/2026 at 4:53:15 AM

This is why you write the tests first and then the code. Especially when fixing bugs, since you can be sure that the test properly fails when the bug is present.

by WhyNotHugo

3/4/2026 at 6:22:37 AM

When fixing bugs, yes. When designing an app not so much because you realize many unexpected things while writing the code and seeing how it behaves. Often the original test code would test something that is never built. It's obvious for integration tests but it happens for tests of API calls and even for unit tests. One could start writing unit tests for a module or class and eventually realize that it must be implemented in a totally different way. I prefer experimenting with the implementation and write tests only when it settles down on something that I'm confident it will go to production.

by pmontra

3/4/2026 at 8:02:11 AM

Where I'm at currently (which may change) is that I lay down the foundation of the program and its initial tests first. That initial bit is completely manual. Then when I'm happy that the program is sufficiently "built up", I let the LLM go crazy. I still audit the tests though personally auditing tests is the part of programming I like the very least. This also largely preserves the initial architectural patterns that I set so it's just much easier to read LLM code.

In a team setting I try to do the same thing and invite team members to start writing the initial code by hand only. I suspect if an urgent deliverable comes up though, I will be flexible on some of my ideas.

by Karrot_Kream

3/4/2026 at 4:23:39 PM

> When fixing bugs, yes.

One thing I want to mention here is that you should try to write a test that not only prevents this bug, but also similar bugs.

In our own codebase we saw that regression on fixed bugs is very low. So writing a specific test for it, isn't the best way to spend your resources. Writing a broad test when possible, does.

Not sure how LLM's handle that case to come up with a proper test.

by koonsolo

3/5/2026 at 9:23:40 PM

[dead]

by boristane007

3/4/2026 at 12:54:30 PM

I'd argue the AI writing the tests shouldn't even know about the implementation at all. You only want to pass it the interface (or function signatures) together with javadocs/docstrings/equivalent.

by pipecmd

3/4/2026 at 6:02:12 AM

Agreed 1000%. But that can be a lot of work; creating a good set of tests is nearly as much or often even more effort than implementing the thing being tested.

When LLMs can assist with writing useful tests before having seen any implementation, then I’ll be properly impressed.

by usefulcat

3/4/2026 at 6:33:55 AM

from experience, AI is bad at TDD. they can infer tests based on written code, but are bad at writing generalised test unless a clear requirement is given, so you the engineer is doing most of the work anyway.

by byzantinegene

3/4/2026 at 4:42:20 PM

My day job has me working on code that is split between two different programming languages. I'd say LLMs are pretty good at TDD in one of those languages and a hot mess in the other.

Which, funny enough, is a pretty good reflection of how I thought of the people writing in those languages before LLMs: One considers testing a complete afterthought and in the wild it is rare to find tests at all, and when they are present they often aren't good. Whereas the other brings testing as a first-class feature and most codebases I've seen generally contain fairly decent tests.

No doubt LLM training has picked up on that.

by 9rx

3/4/2026 at 1:22:26 PM

I don't think it addresses the problem.

Writing the tests first and then writing code to pass the tests is no better than writing the code first then writing tests that pass. What matter is that both the code and the tests are written independently, from specs, not from one another.

I think that it is better not to have access to tests when first writing code, as to make sure to code the specs and not code the tests that test the specs as something may be lost in translation. It means that I have a preference for code first, but the ideal case would be for different people to do it in parallel.

Anyway, about AI, in an AI writes both the tests and the code, it will make sure they match no matter what comes first, it may even go back and forth between the tests and code, but it doesn't mean it is correct.

by GuB-42

3/4/2026 at 1:57:25 PM

Tests are your spec. You write them first because that is the stage when you are still figuring out what you need to write.

Although TDD says that you should only write one test before implementing it, encouraging spec writing to be an iterative process.

Writing the spec after implementation means that you are likely to have forgotten the nuance that went into what you created. That is why specs are written first. Then the nuance is captured up front as it comes to mind.

by 9rx

3/4/2026 at 3:00:04 PM

Tests are not any more or any less of a spec than the code. If you are implementing a HTTP server for instance, RFC 7231 are your specs, not your tests, not your code.

I would say that which come first between specs and code depend on the context. If you are implementing a standard, the specs of the standard obviously come first, but if you are iterating, maybe for a user interface, it can make sense to start with the code so that you can have working prototypes. You can then write formal documents and tests later, when you are done prototyping, for regression control.

But I think that leaning on tests is not always a good idea. For example, let's continue with the HTTP server. You write a test suite, but there is a bug in your tests, I don't know, you confuse error 404 and 403. The you write your code, correctly, run the tests, see that one of your tests fail and tell you have returned 404 and not 403. You don't think much, after all "the tests are the specs", and change the code. Congratulations, you are now making sure your code is wrong.

Of course, the opposite can and do happen, writing the code wrong and making passing test without thinking about what you actually testing, and I believe that's why people came up with the idea of TDD, but for me, test-first flip the problem but doesn't solve it. I'd say the only advantage, if it is one, is that it prevents taking a shortcut and releasing untested code by moving tests out of the critical path.

But outside of that, I'd rather focus on the code, so if something are to be "the spec", that's it. It is the most important, because it is the actual product, everything else is secondary. I don't mean unimportant, I mean that from the point of view of users, it is better for the test suite to be broken than for the code to be broken.

by GuB-42

3/4/2026 at 3:28:54 PM

> RFC 7231 are your specs

It is more like a meta spec. You still have to write a final spec that applies to your particular technical constraints, business needs, etc. RFC 7231 specifies the minimum amount necessary to interface with the world, but an actual program to be deployed into the wild requires much, much more consideration.

And for that, since you have the full picture not available to a meta spec, logically you will write it in a language that both humans and computers can understand. For the best results, that means something like Lean, Rocq, etc. However, in the real world you likely have to deal with middling developers straight out of learn to code bootcamps, so tests are the practical middle ground.

> I don't know, you confuse error 404 and 403.

Just like you would when writing RFC 7231? But that's what the RFC process is for. You don't have to skip the RFC process just because the spec also happens to be machine readable. If you are trying to shortcut the process, then you're going to have this problem no matter what.

But, even when shortcutting the process, it is still worthwhile to have written your spec in a machine-readable format as that means any changes to the spec automatically identify all the places you need to change in implementation.

> writing the code wrong and making passing test without thinking about what you actually testing

The much more likely scenario is that the code is right, but a mistake in the test leads it to not test anything. Then, years down the road after everyone has forgotten or moved on, when someone needs to do some refactoring there is no specification to define what the original code was actually supposed to do. Writing the test first means that you have proven that it can fail. That's not the only reason TDD suggests writing a test first, but it is certainly one of them.

> It is the most important, because it is the actual product

Nah. The specification is the actual product; it is what lives for the lifetime of the product. It defines the contract with the user. Implementation is throwaway. You can change the implementation code all day long and as long as the user contract remains satisfied the visible product will remain exactly the same.

by 9rx

3/4/2026 at 10:35:37 PM

> The much more likely scenario is that the code is right, but a mistake in the test leads it to not test anything.

What I usually do to prevent this situation is to write a passing test, then modify the code to make it fail, then revert the change. It also gives an occasion to read the code again, kind of like a review.

I have never seen this practice formalized though, good for me, this is the kind of things I do because I care, turning it into a process with Jira and such is a good way to make me stop caring.

by GuB-42

3/5/2026 at 11:18:07 AM

> I have never seen this practice formalized though

Isn't that what is oft known as mutation testing? It is formalized to the point that we have automation to do the mutation for you automatically.

by 9rx

3/5/2026 at 5:59:35 PM

Thank you, I wasn't aware of this, this is the kind of thing I wish people were more aware of, kind of like fuzzing, but for tests.

About fuzzing, I have about 20 years of experience in development and I have never seen fuzzing being done as part of a documented process in a project I worked in, not even once. Many people working in validation don't even know that it exists! The only field where fuzzing seems to be mainstream is cybersecurity, and most fuzzing tools are "security oriented", which is nice but it doesn't mean that security is the only field where it is useful.

Anyways, what I do is a bit different in that it is not random like fuzzing, it is more like reverse-TDD. TDD starts with a failing test, then, you write code to pass the test, and once done, you consider the code to be correct. Here you start with a passing test, then, you write code to fail the test, and once done, you consider the test to be correct.

by GuB-42

3/4/2026 at 2:08:58 PM

Also, if you find after implementation that the spec wasn't specific enough, go ahead and refresh the spec and have the LLM redo the code, from scratch if necessary. Writing code is so cheap right now, it takes a different mindset in general.

by mycall

3/4/2026 at 10:20:35 AM

try this for a UI

by dustingetz

3/4/2026 at 3:55:12 AM

the test generation loop is the real trap. you ask the agent to write code, then ask it to write tests for that code. of course the tests pass. they're testing what the code does, not what it should do.

we ran into this building a task manager. the PUT endpoint set completed=true but never set the completion timestamp. the agent-written tests all passed because they tested "does it set completed to true" not "does it record when it was completed." 59 tasks in production with null timestamps before a downstream report caught it.

the fix was trivial. the gap in verification wasn't.

by MartyMcBot

3/4/2026 at 4:31:07 AM

Once upon a time people advocated writing tests first…

by tbossanova

3/4/2026 at 4:51:14 AM

once upon a time 'engineering' in software had some meaning attached to it...

no other engineering profession would accept the standards(or rather their lack of) on which software engineering is running.

by codegladiator

3/4/2026 at 8:02:57 AM

> no other engineering profession would accept the standards(or rather their lack of) on which software engineering is running.

I have bad news for you: they are pushing those "standards" (Agile, ASPICE) also in hardware and mechanical engineering.

The results can be seen already. Testing is expensive and this is the field where most savings can be implemented.

by hulitu

3/4/2026 at 12:39:04 PM

Agile isn't a coding standard or approach.

by philipallstar

3/4/2026 at 11:10:43 AM

Once upon a time people were thinking about what they're doing. LLMs absolve people from thinking

by eithed

3/4/2026 at 11:38:50 AM

Engineers aren't paid to think. They are paid to be replacable cogs who can be fired the moment they show independent thought.

by harimau777

3/4/2026 at 5:45:03 AM

i dont think that would help. the agent would hard code the test details into the code.

by 8note

3/4/2026 at 7:24:39 AM

I wasn't able to force the agent to write failing tests yet. Although I'm sure it should be possible to do.

by scotty79

3/4/2026 at 7:44:55 AM

I do that all the time with Claude. What part is not working?

by himlion

3/4/2026 at 9:29:25 AM

I don't really use anthropic models. But when I tried it with others they can write tests but they never confirm that they fail before they proceed to make implementation that causes them to pass. Maybe I didn't prompt it forcibly enough.

by scotty79

3/4/2026 at 12:39:59 PM

I haven’t tried this (yet), but I’ve heard of people disabling write access to test code while the agent is writing implementation and vice versa. I imagine “disabling” could be done via prompting, or just a quick one liner like: chmod -r 0644 ./tests

by catlifeonmars

3/4/2026 at 1:30:22 PM

The magic word is "use red/green testing", that makes it create the tests first, confirm they fail (as they should), then it writes the code to match.

by theshrike79

3/5/2026 at 9:23:49 PM

[dead]

by boristane007

3/4/2026 at 5:57:19 AM

At my job we have a requirement for 100% test coverage. So everyone just uses AI to generate 10,000 line files of unit tests and nobody can verify anything.

by porphyra

3/4/2026 at 11:37:08 AM

Exactly! It's frustrating how much developers get blamed for the outcomes of incompetent management.

by harimau777

3/4/2026 at 12:38:45 PM

> everyone just uses AI to generate 10,000 line files of unit tests and nobody can verify anything

This is not a guaranteed outcome of requiring 100% coverage. Not that that's a good requirement, but responding badly to a bad requirement is just as bad.

by philipallstar

3/4/2026 at 10:52:06 PM

> The icing on the cake is that LLMs are producing so much code that humans are just rubber stamping all of it.

I don't understand the value of that much code. What features are worth that much more than stability?

by HWR_14

3/4/2026 at 2:53:11 AM

> At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").

Obvious question: why not? Let’s say you have competent devs, fair assumption. Maybe it’s because they don’t have enough time for solid QA? Lots of places are feature factories. In my personal projects I have more lines of code doing testing than implementation.

by Herring

3/4/2026 at 3:04:08 AM

It’s because people will do what they’re incentivized to do. And if no one cares about anything but whether the next feature goes out the door, that’s what programmers will focus on.

Honestly I think the other thing that is happening is that a lot of people who know better are keeping their mouths shut and waiting for things to blow up.

We’re at the very peak of the hype cycle right now, so it’s very hard to push back and tell people that maybe they should slow down and make sure they understand what the system is actually doing and what it should be doing.

by sarchertech

3/4/2026 at 3:09:56 AM

Or if you say we should slow down your competence is questioned by others who are going very fast (and likely making mistakes we won't find until later).

And there is an element of uncertainty. Am I just bad at using these new tools? To some degree probably, but does that mean I'm totally wrong and we should be going this fast?

by shigawire

3/4/2026 at 12:45:45 PM

There is a saying: slow is smooth and smooth is fast.

I have personally outpaced some of my more impatient colleagues by spending extra time up front setting up test harnesses, reading specifications, etcetera. When done judiciously it pays off in time scales of weeks or less.

by catlifeonmars

3/4/2026 at 6:18:10 AM

oh yeah, let them dig a hole and charge sweet consultant rates to fix it. the the healing can begin

by citizenpaul

3/4/2026 at 11:40:23 AM

Developers aren't given time to test and aren't rewarded if they do, but management will rain down hellfire upon their heads if they don't churn out code quickly enough.

by harimau777

3/4/2026 at 3:18:27 AM

How about a subsequent review where a separate agent analyzes the original issue and resultant code and approves it if the code meets the intent of the issue. The principle being to keep an eye out for manual work that you can describe well enough to offload.

Depending on your success rate with agents, you can have one that validates multiple criteria or separate agents for different review criteria.

by ojo-rojo

3/4/2026 at 4:12:40 AM

You are fighting nondeterministic behavior with more nondeterministic behavior, or in other words, fighting probability with probability. That doesn't necessarily make things any better.

by g947o

3/4/2026 at 4:44:37 AM

In my experience, an agent with "fresh eyes", i.e., without the context of being told what to write and writing it, does have a different perspective and is able to be more critical. Chatbots tend to take the entire previous conversational history as a sort of canonical truth, so removing it seems to get rid of any bias the agent has towards the decisions that were made while writing the code.

I know I'm psychologizing the agent. I can't explain it in a different way.

by pyridines

3/4/2026 at 6:28:29 AM

Fresh eyes, some contexts and another LLM.

The problem is information fatigue from all the agents+code itself.

by Foobar8568

3/4/2026 at 6:15:07 AM

I think of it as they are additive biased. ie "dont think about the pink elephant ". Not only does this not help llms avoid pink elphants instead it guarantees that pink elephant information is now being considered in its inference when it was not before.

I fear thinking about problem solving in this manner to make llms work is damaging to critical thinking skills.

by citizenpaul

3/4/2026 at 4:53:35 AM

Aren't human coders also nondeterministic?

Assigning different agents to have different focuses has worked for me. Especially when you task a code reviewer agent with the goal of critically examining the code. The results will normally be much better than asking the coder agent who will assure you it's "fully tested and production ready"

by hex4def6

3/4/2026 at 11:05:34 AM

Human coders are far more reliable. The only downside is speed, and therefore cost

by samrus

3/4/2026 at 4:31:44 AM

Probably true

(Sorry.)

by tbossanova

3/4/2026 at 11:04:33 AM

Slop on slop. Who watches rhe watchman?

by samrus

3/4/2026 at 8:50:40 AM

How long till the industry discover TDD?

by yanis_t

3/4/2026 at 5:47:17 AM

> At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").

its fun having LLMs because it makes it quite clear that a lot of testing has been cargo-culting. did people ever check often that the tests check for anything meaningful?

by 8note

3/4/2026 at 6:26:07 AM

15years ago, I had tester writing "UI tests" / "User tests" that matched what the software was cranking out. At that time I just joined to continue at the client side so I didn't really worked on anything yet.

I had a fun discussion when the client tried to change values... Why is it still 0? Didn't you test?

And that was at that time I had to dive into the code base and cry.

by Foobar8568

3/4/2026 at 2:01:53 PM

Test automation is kind of like a religion. It is comforting to believe that the solution to code is more code.

by mattacular

3/4/2026 at 6:23:00 AM

Property testing could've helped

by taatparya

3/4/2026 at 1:10:17 PM

Building a C compiler should not have this problem. There is probably a million test suites coming from outside the LLM that it can sue verify correctness.

by Illniyar

3/4/2026 at 8:05:30 AM

I think it boils down to how companies view LLMs and their engineers.

Some companies will do as you say - have (mostly clueless) engineers feed high level "wishes" to (entirely clueless) LLMs, and hope that everyone kind of gets it. And everyone will kind of get it. And everyone will kind of get it wrong.

Other companies will have their engineers explicitly treat the LLMs as collaborators / pair programmers, not independent developers. As an engineer in such a company, YOU are still the author of the code even if you "prompted" it instead of typing it. You can't just "fix this high level thing for me brah" and get away with it, but instead need to continuously interact with the LLM as you define and it implements the detailed wanted behaviors. That forces you to know _exactly_ what you want and ask for _exactly_ what you want without ambiguity, like in any other kind of programming. The difference is that the LLM is a heck of a lot quicker at typing code than you are.

by ZaoLahma

3/4/2026 at 11:32:03 AM

Honestly, unit tests (at least on the front-end) are largely wasted time in the current state of software development. Taking the time that would have been spent on writing unit tests and instead using it to write functionally pure, immutable code would do much more to prevent bugs.

There's also the problem that when stack rank time comes around each year no one cares about your unit tests. So using AI to write unit tests gives me time to work on things that will actually help me avoid getting arbitrarily fired.

I wish that software engineers were given the time to write both clean code and unit tests, and I wish software engineers weren't arbitrarily judged by out of touch leadership. However, that's not the world we live in so I let AI write my unit tests in order to survive.

by harimau777

3/4/2026 at 11:50:00 AM

You are overvaluing “clean code.” Code is code, it either works within spec or it doesn’t; or, it does but there are errors, more or less catastrophic, waiting to show themselves at any moment. But even in that latter case, no single individual can know for certain, no matter how much work they put in, that their code is perfect. But they can know its useable, and someone else can check to make sure it doesn’t blow something else up, and that is the most important thing.

by DiscourseFan

3/4/2026 at 11:49:56 AM

I like unit tests when I have to modify code that someone made years ago, as a basic sanity check.

by msh

3/4/2026 at 5:20:11 AM

Yeah this is the exact kind of ridiculousness I've noticed as well - everything that comes out of an LLM is optimized to give you what you want to hear, not what's correct.

by IAmGraydon

3/4/2026 at 8:11:39 AM

> The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code

This is true for humans too. Tests should not be written or performed by the same person that writes the code

by mrighele

3/4/2026 at 8:43:45 AM

That's a complete fantasy world where companies have twice the engineers they actually need instead of half.

by nly

3/4/2026 at 9:40:23 AM

> [Reviews] should not be written or performed by the same person that writes the code

> That's a complete fantasy world where companies have twice the engineers they actually need instead of half.

by missingdays

3/4/2026 at 11:41:35 AM

Agreed, but then companies shouldn't complain about the consequences of understaffing their teams.

by harimau777

3/5/2026 at 8:03:21 PM

Mwahahahahaha! Suffer, devs, SUFFER! KNOW MY PAIN!

Ah hem... Welcome to the wonderful world of Quality Assurance, software developing audience. That part of the job, after you yeet your code over the fence, where the job is to bridge the gap between your madness, and the madness of the rest of the business. Here you will find: frustration, an ever present sense the rest of the world is just out to make your life more difficult, a creeping sense of despair, a hot ice pick in the back of your mind every time the language model does something syntactically valid, but completely nonsensical in the real world, the development of an ever increasing time horizon over which you can accurately predict the future, but no one will believe you anyway, a smoldering hatred of the overly confident executive with an over developed capacity for risk tolerance; a desire to run away and start a farm, and finally, a fundamental distrust of everything software, and all the people who write it.

Don't forget your complimentary test framework and swag bag on your way out, and remember, you're here forever. You can try to check out, but you can never leave.

by salawat

3/4/2026 at 1:24:02 PM

My only hope is that all of this push leads in the end to the adoption of more formal verification languages and tools.

If people are having to specify things in TLA+ etc -- even with the help of an LLM to write that spec -- they will then have something they can point the LLM at in order for it to verify its output and assumptions.

by cmrdporcupine

3/4/2026 at 11:26:44 AM

Long time ago in France the mainstream view by computer people was that code or compute weren't what's important when dealing with computers, it is information that matters and how you process it in a sensible way (hence the name of computer science in French: informatique. And also the name for computer: “ordinateur”, literally: what sets things into order).

As a result, computer students were talked a lot (too much for most people's taste, it seems) about data modeling and not too much about code itself, which was viewed as mundane and uninteresting until the US hacker culture finally took over in the late 2000th.

Turns out that the French were just right too early, like with the Minitel.

by littlestymaar

3/4/2026 at 11:53:14 AM

"Computer science is no more about computers than astronomy is about telescopes." -Dijkstra

by msh

3/4/2026 at 7:23:36 AM

> The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code.

I always felt like that's the main issue with unit testing. That's why I used it very rarely.

Maybe keeping tests in the separate module and not letting th Agent see the source during writing tests and not letting agent see the tests while writing implemntation would help? They could just share the API and the spec.

And in case of tests failing another agent with full context could decide if the fix should be delegated to coding agent or to testing agent.

by scotty79

3/4/2026 at 4:06:50 AM

This hits hard. I’m getting hit with so much slop at work that I’ve quietly stopped being all that careful with reviews.

by bentobean

3/4/2026 at 6:47:20 AM

>LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code. At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").

You can use spec driven development and TDD. Write the tests first. Write failing code. Modify the code to pass the tests.

by DeathArrow

3/4/2026 at 4:30:19 AM

Um, you're supposed to write the tests first. The agents can't do this?

by SoftTalker

3/4/2026 at 7:45:00 AM

Actually, they extremely bad at that. All training data contains cod + tests, even if tests where created first. So far, all models that I tried failed to implement tests for interfaces, without access to actual code.

by alexsmirnov

3/4/2026 at 4:38:14 AM

They can, but should be explicitly told to do that. Otherwise they just everything in batches. Anyway pure TDD or not but tests catches only what you tell AI to write. AI does not now what is right, it does what you told it to do. The above problem wouldn’t be solved by pure TDD.

by daliusd

3/4/2026 at 10:38:23 AM

> I have no constructive recommendations; I feel the industry will keep their foot on the pedal until something catastrophic happens

I can't wait. Maybe when shitty vibe coded software starts to cause real pain for people we can return to some sensible software engineering

I'm not holding my breath though

by bluefirebrand

3/3/2026 at 7:51:37 PM

I encourage everyone to RTFA and not just respond to the headline. This really is a glimpse into where the future is going.

I've been saying "the last job to be automated will be QA" and it feels more true every day. It's one thing to be a product engineer in this era. It's another to be working at the level the author is, where code needs to be verifiable. However, once people stop vibing apps and start vibing kernels, it really does fundamentally change the game.

I also have another saying: "any sufficiently advanced agent is indistinguishable from a DSL." I hadn't considered Lean in this equation, but I put these two ideas together and I feel like we're approaching some world where Lean eats the entire agentic framework stack and the entire operating system disappears.

If you're thinking about building something today that will still be relevant in 10 years, this is insightful.

by madrox

3/3/2026 at 10:51:51 PM

There are still no successful useful vibe codes apps. Kernels are pretty far away I think.

by fmbb

3/3/2026 at 11:37:39 PM

This is a very strange statement. People don't always announce when they use AI for writing their software since it's a controversial topic. And it's a sliding scale. I'm pretty sure a large fraction of new software has some AI involved in its development.

by bonoboTP

3/4/2026 at 1:46:19 AM

> new software has some AI involved in its development.

A large part of it is probably just using it as a better search. Like "How do I define a new data type in go?".

by qsera

3/4/2026 at 3:14:46 AM

I strongly agree with this. The only place where AI is uncontroversial is web search summaries.

The real blockers and time sinks were always bad/missing docs and examples. LLMs bridge that gap pretty well, and of course they do. That's what they're designed to be (language models), not an AGI!

I find it baffling how many workplaces are chasing perceived productivity gains that their customers will never notice instead of building out their next gen apps. Anyone who fails to modernize their UI/UX for the massive shift in accessibility about to happen with WebMCP will become irrelevant. Content presentation is so much higher value to the user. People expect things to be reliable and simple. Especially new users don't want your annoying onboarding flow and complicated menus and controls. They'll just find another app that gives them what they want faster.

by sublinear

3/4/2026 at 1:31:43 PM

Define "successful"?

Does it need to be HN-popular or a household name? Be in the news?

Or something that saves 50% of time by automating inane manual work from a team?

by theshrike79

3/4/2026 at 1:38:02 AM

Apps are a strange measure because there aren't really any new, groundbreaking ones. PCs and smartphones have mostly done what people have wanted them to do for a while.

by dehrmann

3/4/2026 at 1:54:59 AM

There are plenty of ground breaking apps but they aren't making billions of advertising revenue, nor do they have large numbers. I honestly think torrent applications (and most peer to peer type of stuff) are very cool and very useful for small-medium groups but it'll never scale to a billion user thing.

Do agree it's a weird metric to have, but can't think of a better one outside of "business" but that still seems like a poor rubric because the vast majority of people care about things that aren't businesses and if this "life altering" technology basically amounts to creating digital slaves then maybe we as a species shouldn't explore the stars.

by shimman

3/3/2026 at 11:41:56 PM

I think this might miss the point. We put off upgrading to an new RMM at work because I was able to hack together some dashboards in a couple days. It's not novel and does exactly what we need it to do, no more. We don't need to pay 1000's of dollars a month for the bloated Solarwinds stack. We aren't saving lives, we're saving PDFs so any arguments about 5 9s and maintainability are irrelevant. LLMs are going to give us on demand, one off software. I think the SaaS market is terrified right now because for decades they've gouged customers for continual bloat and lock in that now we can escape from. In a single day I was able to build an RMM that fits our needs exactly. We don't need to hire anyone to maintain it because it's simple, like most business applications should be, but SV needs to keep complicating their offerings with bloat to justify crazy monthly costs that should have been a one time purchase from the start. SV shot itself in the face with AI.

by tempaccount5050

3/4/2026 at 10:18:43 AM

Name 3 apps that are

1. widely considered successful 2. made by humans from scratch in 2025

It looks like humans and AI are on par in this realm.

by raincole

3/3/2026 at 10:55:08 PM

To be fair, Claude Code is vibe-coded. It's a terrible piece of software from an engineering (and often usability) standpoint, and the problems run deeper than just the choice of JavaScript. But it is good enough for people to get what they want out of it.

by GoatInGrey

3/3/2026 at 11:08:27 PM

But also, based on what I have heard of their headcount, they are not necessarily saving any money by vibecoding it - it seems like their productivity per programmer is still well within the historical range.

That isn’t necessarily a hit against them - they make an LLM coding tool and they should absolutely be dogfooding it as hard as they can. They need to be the ones to figure out how to achieve this sought-after productivity boost. But so far it seems to me like AI coding is more similar to past trends in industry practice (OOP, Scrum, TDD, whatever) than it is different in the only way that’s ever been particularly noteworthy to me: it massively changes where people spend their time, without necessarily living up to the hype about how much gets done in that time.

by bunderbunder

3/4/2026 at 3:01:10 AM

> But it is good enough for people to get what they want out of it.

This is the ONLY point of software unless you’re doing it for fun.

by jmathai

3/4/2026 at 4:29:47 AM

> I encourage everyone to RTFA and not just respond to the headline.

This is an example of an article which 'buries the lede'†.

It should have started with the announcement of the new zlib autoformalization (!) https://leodemoura.github.io/blog/2026/02/28/when-ai-writes-... to get you excited.

Then it should have talked about the rest - instead of starting with rather graceless and ugly LLM-written generic prose about AI topics that to many readers is already tiresomely familiar and doubtless was tldr for even the readers who aren't repelled automatically by that.

† or in my terms, fails to 'make you care': https://gwern.net/blog/2026/make-me-care

by gwern

3/3/2026 at 10:36:10 PM

I am as enthusiastic about formal methods as the next guy, but I very much doubt any LLM-based technique will make it economical to write a substantial fraction of application software in Lean. The LLM can play a powerful heuristic role in searching for proof-bearing code in areas where there is good training data. Unfortunately those areas are few and far between.

Moreover, humans will still need to read even rigorously proved code if only to suss out performance issues. And training people to read Lean will continue to be costly.

Though, as the OP says, this is a very exciting time for developing provably correct systems programming.

by bwestergard

3/3/2026 at 11:06:55 PM

LLMs are writing non-trivial math proofs in Lean, and software proofs tend to be individually easier than proofs in math, just more tedious because there's so much more of them in any non-trivial development.

Some performance issues (asymptotics) can be addressed via proof, others are routinely verified by benchmarking.

by zozbot234

3/3/2026 at 11:53:08 PM

This assumes everything about current capabilities stay static, and it wasn't long ago before LLMs couldn't do math. Many were predicting the genAI hype had peaked this time last year.

If you want it to be a question of economics, I think the answer is in whether this approach is more economical than the alternative, which is having people run this substrate. There's a lot of enthusiasm here and you can't deny there has been progress.

I wouldn't be so quick to doubt. It costs nothing to be optimistic.

by madrox

3/4/2026 at 12:45:54 AM

> and it wasn't long ago before LLMs couldn't do math

They still can't do math.

by candiddevmike

3/4/2026 at 1:04:45 AM

Pro models won gold at the international math olympiads?

by Hammershaft

3/4/2026 at 2:53:54 AM

They have trouble adding two numbers accurately though

by bandrami

3/4/2026 at 7:16:46 AM

Why are they expected to?

by evrimoztamur

3/4/2026 at 1:08:28 PM

If you believe the "AGI is just around the corner" hype...

by haspok

3/5/2026 at 12:23:28 AM

Going to be hard to convince normies it can do harder things than that if it can't do that

by bandrami

3/4/2026 at 10:00:10 AM

[*] According to cloud LLM provider benchmarks.

by otabdeveloper4

3/3/2026 at 9:08:32 PM

> "any sufficiently advanced agent is indistinguishable from a DSL."

I don't quite follow but I'd love to hear more about that.

by charlieflowers

3/3/2026 at 11:41:58 PM

If you give an agent a task, the typical agentic pattern is that it calls tools in some non-deterministic loop, feeding the tool output back into the LLM, until it deems the task complete. The LLM internalizes an algorithm.

Another way of doing it is the agent just writes an algorithm to perform the task and runs it. In this world, tools are just APIs and the agent has to think through its entire process end to end before it even begins and account for all cases.

Only latter is turing complete, but the former approaches the latter as it improves.

by madrox

3/3/2026 at 10:15:14 PM

https://en.wikipedia.org/wiki/Clarke's_three_laws

by esafak

3/3/2026 at 11:31:43 PM

No i get the clarke reference. But how is an agent a dsl?

by charlieflowers

3/3/2026 at 11:55:24 PM

Maybe not an agent exactly but I can see an agentic application is kind of like a dsl because the user space has a set of queries and commands they want to direct the computer to take action but they will describe those queries and commands in English and not with normal programming function calls

by whattheheckheck

3/3/2026 at 11:49:22 PM

My read was roughly that agents require constraining scaffolding (CLAUDE.md) and careful phrasing (prompt engineering) which together is vaguely like working in a DSL?

by thinkling

3/3/2026 at 10:51:41 PM

If the llm is able to code it, there is enough training data that youight be better off in a different language that removes the boilerplate.

by jpollock

3/4/2026 at 12:25:46 PM

> RTFA

Sigh. Is there any LLM solution for HN reader to filter out all top-level commenters that hadn't RTFA? I don't need the (micro-)shitstorms that these people spawn, even if the general HN algo scores these as "interesting".

by kubanczyk

3/4/2026 at 2:31:30 AM

I believe there is a Verification Complexity Barrier

As you add components to a system, the time it takes to verify that the components work together increases superlinearly.

At a certain point, the verification complexity takes off. You literally run out of time to verify everything.

AI coding agents hit this barrier faster than ever, because of how quickly they can generate components (and how poorly they manage complexity).

I think verification is now the problem of agentic software engineering. I think formal methods will help, but I don't see how they will apply to messy situations like end-to-end UI testing or interactions between the system and the real world.

I posted more detailed thoughts on X: https://x.com/i/status/2027771813346820349

by wyum

3/4/2026 at 7:46:52 PM

Hi William, thank you for the interesting post!

> At a certain point, the verification complexity takes off. You literally run out of time to verify everything.

Could you elaborate on this? Your post makes it sound as if the verification complexity diverged as the number of components n approaches a certain finite value n_0, but that seems unlikely to me. If, in contrast, the verification complexity remains finite at n_0, then verification should still be possible in finite time, shouldn't it? Yes, it might be a considerable amount of time but I assume your theorem doesn't predict lower bounds for the involved constants?

Either way, this entire discussion assumes n will increase as more and more software gets written by AI. Couldn't it also be the opposite, though? AI might also lead us to removing unnecessarily complex dependencies from our software supply chain or stripping them down to the few features we need.

by codethief

3/4/2026 at 6:07:15 AM

Thank you for the post. It's a good read. I'm working on governance/validation layers for n-LLMs and making them observable so your comments on runaway AIs resonated with me. My research is pointing me to reputation and stake consensus mechanisms being the validation layer either pre inference or pre-execution, and the time to verify decisions can be skipped with enough "decision liquidity" via reputation alone aka decision precedence.

by kaicianflone

3/4/2026 at 9:01:50 AM

Very deep post on the problem. AI seems to worsen the issue of software correctness and given the nature of business, it won’t be ever solved.

by thorn

3/3/2026 at 7:03:07 PM

The article says that AWS's Cedar authorization policy engine is written in Lean, but it's actually written in Dafny. Writing Dafny is a lot closer to writing "normal" code rather than the proofs you see in Lean. As a non-mathematician I gave up pretty early in the Lean tutorial, while in a recent prototype I learned enough Dafny to be semi-confident in reviewing Claude's Dafny code in about half a day.

The Dafny code formed a security kernel at the core of a service, enforcing invariants like that an audit log must always be written to prior to a mutating operation being performed. Of course I still had bugs, usually from specification problems (poor spec / design) or Claude not taking the proof far enough (proving only for one of a number of related types, which could also have been a specification problem on my part).

In the end I realized I'm writing a bunch of I/O bound glue code and plain 'ol test driven development was fine enough for my threat model. I can review Python code more quickly and accurately than Dafny (or the Go code it eventually had to link to), so I'm back to optimizing for humans again...

by muraiki

3/4/2026 at 12:41:14 AM

Cedar used to be written in Dafny, but AWS abandoned that implementation and rewrote it in Lean.

https://aws.amazon.com/blogs/opensource/lean-into-verified-s...

by dranov

3/4/2026 at 2:19:58 AM

Oh whoops, thank you for the correction! I didn't realize that.

by muraiki

3/4/2026 at 2:18:21 AM

Looks like LLMs also find Dafny easier to write than Lean. This study, “A benchmark for vericoding: formally verified program synthesis”, reports:

> We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications … We find vericoding success rates of 27% in Lean, 44% in Verus/Rust and 82% in Dafny using off-the-shelf LLMs.

https://arxiv.org/html/2509.22908v1

by cpeterso

3/4/2026 at 4:13:52 AM

Not surprising, as Dafny is a bit less expressive (refinement instead of dependent types) and therefore easier to write. IMHO, it hits a very nice sweet spot. The disadvantage of Dafny is the lack of manual tactics to prove things when SAT/SMT automation fails. But this is getting fixed.

by nextos

3/4/2026 at 9:43:22 AM

There's multiple Lean tutorials, some of which are more mathy than others. One of the things I like about Lean is precisely that it's an ordinary, Haskell-style functional programming language in addition to having all the Curry-Howard-isomorphism-based mathematical proof machinery. You can write `cat` in Lean.

by JuniperMesos

3/4/2026 at 1:07:17 AM

Maybe I'm missing something, but isn't this the same as writing code, but with extra steps?

Currently, engineers work with loose specifications, which they translate into code. With the proposed approach, they would need to first convert those specifications into a formally verifiable form before using LLMs to generate the implementation.

But to be production-ready, that spec would have to cover all possible use-cases, edge cases, error handling, performance targets, security and privacy controls, etc. That sounds awfully close to being an actual implementation, only in a different language.

by maltalex

3/4/2026 at 3:17:05 AM

The key bit is that specifications don't need to be "obviously computable", so they can be a lot simpler than the code that implements them. Consider the property "if some function has a reference to a value, that value will not change unless that function explicitly changes it". It's simple enough to express, but to implement it Rust needs the borrow checker, which is a pretty heavy piece of engineering. And proving the implementation actually guarantees that property isn't easy, either!

by hwayne

3/4/2026 at 4:06:41 AM

Formal specifications can be easier to write than code. Take refinement types in Dafny, for example. Because they are higher-level, you don't need to bother with tedious implementation details. By shifting our focus to formal specifications rather than manual coding, we could possibly leverage automation to not only develop software faster but also achieve a higher degree of assurance.

Of course, this remains largely theoretical for now, but it is an exciting possibility. Note high-level specifications often overlook performance issues, but they are likely sufficient for most scenarios. Regardless, we have formal development methodologies able to decompose problems to an arbitrary level of granularity since the 1990s, all while preserving correctness. It is likely that many of these ideas will be revisited soon.

by nextos

3/4/2026 at 3:59:42 AM

The fundamental problem is the verification loop for the average developer is grounded not in tests, but with results. Write code, reload browser, check output. Does it work the way I want? Good. We're done here.

Not write code, write tests, ensure all test-cases are covered. Now, imagine such a flaky foundation is used to build on top of even more untested code. That's how bad quality software (that's usually unfixable without a major re-write) is born.

Also, most vibe-coders don't have enough experience/knowledge to figure out what is wrong with the code generated by the AI. For that, you need to know more than the AI and have a strong foundation in the domain you're working on. Here is an example: You ask the AI to write the code for a comment form. It generates the backend and frontend code for you (let's say React/Svelte/Vue/whatever). The vibe-coder sees the UI - most likely written in Tailwind CSS and thinks "wow, that looks really good!" and they click approve. However, an experienced person might notice the form does not have CSRF protection in place. The vibe-coder might not even be aware of the concept of CSRF (let alone the top 10 OSWAP security risks).

Hence, the fundamental problem is not knowing about the domain more than the AI to pick up the flaw. Unless this fundamental problem is solved - which I don't think it will anytime soon because everyone can generate code + UI these days, I don't see a solution to the verification problem.

However, this is good news for consultants and the like because it creates more work down the line to fix the vide-coded mess because they got hacked the very next day and we can charge them a rush fee on top of it, too. So, it's not all that bad.

by neya

3/4/2026 at 4:18:55 AM

My impression has been with frontend tests is they come in two flavors: extremely hard to write to the point of impracticality, and useless. Most orgs settle for the latter and end up testing easily testable things like the final rendered dom and rely on human qa for all the hard bits like "does the page actually look like it's supposed to, does interacting with the UI elements behave correctly, and do the flows actually work."

All largely stemming from the fact that tests can't meaningfully see and interact with the page like the end user will.

by Spivak

3/4/2026 at 5:10:40 AM

> extremely hard to write to the point of impracticality, and useless > Most orgs settle for the latter and end up testing easily testable things like the final rendered dom and rely on human qa for all the hard bits like "does the page actually look like it's supposed to

Not disagreeing with you here, but what ends up happening is the frontend works flawlessly in the browser/device being tested but has tons of bugs in the others. Best examples of this are most banking apps, corporate portals, etc. But honestly, you can get away with the frontend without writing tests. But the backend? That directly affects the security aspect of the software and cannot afford to skip important tests atleast.

by neya

3/4/2026 at 4:46:13 AM

>can't meaningfully see and interact with the page like the end user will

Isn't this a great use case for LLM tests? Have a "computer use agent" and then describe the parameters of the test as "load the page, then navigate to bar, expect foo to happen". You don't need the LLM to generate a test using puppeteer or whatever which is coupled to the specific dom, you just describe what should happen.

by krackers

3/4/2026 at 5:38:13 AM

Depends. Does it represent end users well enough? Does it hit the same edge cases as a million users would (especially considering poor variety of heavily post-trained models)? Does it generalize?

by orbital-decay

3/4/2026 at 5:49:23 AM

> computer use agent

they arent good enough yet at all.

i got an agent to use the windows UIA with success for a feedback loop, and it got the code from not wroking very well to basically done overnight, but without the mcp having good feedback and tagged/id-ed buttons and so on, the computer use was just garbage

by 8note

3/4/2026 at 7:18:25 AM

This is actually how a lot of software is written, sadly. I used to call it "trial and error programming". I've observed people writing C++ who do not have a mental model of memory but just try stuff until it compiles. For some classes of software, like games, that is acceptable, but for others it would be horrifying.

Now, it is actually completely possible to write UI code without any unit tests in a completely safe way. You use the functional core, imperative shell approach. When all your domain logic is in a fully tested, functional core, you can just go ahead and write "what works" in a thin UI shell. Good luck getting an LLM to rigidly conform to such an architecture, though.

by globular-toast

3/5/2026 at 9:20:42 PM

So you feed the LLM some C code like zlib and it rewrites and proves it.

The goal is to make the code write-only and replace it with spec declaration? ... math ppl still cant accept the "x = 1;" statement :)

by ddanv

3/3/2026 at 6:14:52 PM

I think the issue goes even deeper than verification. Verification is technically possible. You could, in theory, build a C compiler or a browser and use existing tests to confirm it works.

The harder problem is discovery: how do you build something entirely new, something that has no existing test suite to validate against?

Verification works because someone has already defined what "correct" looks like. There is possible a spec, or a reference implementation, or a set of expected behaviours. The system just has to match them.

But truly novel creation does not have ground truth to compare against and no predefined finish line. You are not just solving a problem. You are figuring out what the problem even is.

by _pdp_

3/3/2026 at 6:35:15 PM

Well that's a problem the software industry has been building for itself for decades.

Software has, since at least the adoption of "agile" created an industry culture of not just refusing to build to specs but insisting that specs are impossible to get from a customer.

by Avshalom

3/3/2026 at 7:07:34 PM

Agile hasn't been insisting that specs are impossible to get from a customer. They have been insisting that getting specs from a customer is best performed as a dynamic process. In my opinion, that's one of agile's most significant contributions. It lines up with a learning process that doesn't assume the programmer or the customer knows the best course ahead of time.

by daveguy

3/3/2026 at 11:28:51 PM

I have found that it works well as an open-endlessly dynamic process when you are doing the kind of work that the people who came up with Scrum did as their bread and butter: limited-term contract jobs that were small enough to be handled by a single pizza-sized team and whose design challenges mostly don’t stray too far outside the Cynefn clear domain.

The less any of those applies, the more costly it is to figure it out as you go along, because accounting for design changes can become something of a game of crack the whip. Iterative design is still important under such circumstances, but it may need to be a more thoughtful form of iteration that’s actively mindful about which kinds of design decisions should be front-loaded and which ones can be delayed.

by bunderbunder

3/4/2026 at 3:31:34 AM

You definitely need limits around it. Especially as a consulting team. It's not for open ended projects, and if you use it for open ended projects as a consultant you're in for a world of hurt. On the consultant side, hard scope limits are a must.

And I completely agree that requirement proximity estimation is a critical skill. I do think estimation of requirement proximity is a much easier task than time estimates.

by daveguy

3/3/2026 at 10:27:02 PM

And good luck when getting misaligned specs (communication issues customer side, docs that are not aligned with the product,...). Drafting specs and investigating failure will require both a diplomat hat and a detective hat. Maybe with the developer hat, we will get DDD being meaningful again.

by skydhash

3/4/2026 at 12:41:04 AM

I always try to get the customer to provide specs, and failing that, to agree to specs before we start working. It's usually very difficult.

by bigfishrunning

3/3/2026 at 10:23:15 PM

Agile is a pretty badly defined beast at the best of times but even the most twisted interpretation doesnt mean that. It's mainly just a rejection of BDUF.

by pydry

3/4/2026 at 12:13:25 AM

I don’t want to put words in your mouth but I think I agree. It’s called requirements engineering. It’s hard, but it’s possible and waterfall works fine for many domains. Agile teams I see burning resources doing the same thing 2-3x or sprinting their way into major, costly architectural mistakes that would have been easily avoided by upfront planning and specs.

by user3939382

3/4/2026 at 5:53:20 AM

There's two ways of thinking about tests:

A) They let you verify that implementation passes its spec, more or less.

B) They are a (trustworthy) description of how the system behaves, they allow you to understand the system better.

by galaxyLogic

3/4/2026 at 10:55:38 AM

What’s striking here is the convergence on a minimal axiomatic kernel (Lean) as the only scalable way to guarantee coherent reasoning. Some of us working on foundational physics are exploring the same methodological principle. In the “Functional Universe” framework[0], for example, we start from a small set of axioms and attempt to derive physical structure from that base.

The domains are different, but the strategy is similar: don’t rely on heuristics or empirical patching; define a small trusted core of axioms and generate coherent structure compositionally from there.

[0] https://voxleone.github.io/FunctionalUniverse

by voxleone

3/4/2026 at 2:42:37 PM

> Most people think of verification as a cost, a tax on development, justified only for safety-critical systems. That framing is outdated.

> The value is not in the verification workforce. It is in what verified delivery enables

These takes misrepresent safety-critical software verification. The verification workforce is indispensable. Whereas the software engineers are usually highly specialized, the verification team owns the domain and project knowledge to make sure the whole thing integrates. They consult on requirements, anticipate oversights, build infrastructure and validate it, and execute the verification.

So when the business needs a go / no-go call, they ask the verification lead.

Automation and proofs have their place, but I don't see how verification without human accountability is workable.

by randusername

3/4/2026 at 7:52:46 AM

I found my best use of AI for writing code is with a TDD approach. I write 1 test, watch it fail, let the AI implement the solution, and then if I like it I commit. Write the next test and repeat.

It keeps me in the loop, I'm testing actual functionality rather than code, and my code is always in a state where I can open a PR and merge it back to main.

by phyzix5761

3/5/2026 at 9:23:29 PM

loving this! this is the proof of this love https://github.com/pravsingh/boris

by boristane007

3/4/2026 at 4:17:48 AM

The uncomfortable truth is that most teams were already bad at verification before AI entered the picture. The difference is that AI-generated code comes in faster, so the verification bottleneck becomes painfully obvious. I think the real opportunity here is that AI forcing us to think harder about specifications and correctness might actually improve the quality of human-written code too — it's making us confront the fact that "it works on my machine" was never real verification.

by SurvivorForge

3/4/2026 at 12:19:59 AM

Someone actually did fuzz the claude c compiler:

https://john.regehr.org/writing/claude_c_compiler.html

by faitswulff

3/4/2026 at 6:26:24 AM

> No one is formally verifying the result

This might be the case for a hobby project or a start-up MVP being created in a rush, but in reality, there are a few points we may want to take into account:

1. Software teams I work with are maintaining the usual review practices. Even if a feature is completely created by AI. It goes through the usual PR review process. The dev may choose "Accept All", although I am not saying this is a good practice, the change still gets reviewed by a human.

2. From my experience, sub-agents intended for code and security review do a good job. It is even possible to use another model to review the code, which can provide a different perspective.

3. A year ago, code written by AI was failing to run the first time, requiring a painful joint troubleshooting effort. Now it works 95% of the time, but perhaps it is not optimal. Given the speed at which it is improving, it is safe to expect that in 6-9 months' time, it will not only work but will also be written to a good quality.

by cagz

3/4/2026 at 12:25:28 AM

The first thing you should have AI write is a comprehensive test suite. Then have it implement the main functionality. If the tests pass that is one level of verification.

In addition you can have one AI check another AI's code. I routinely copy/paste code from Claude to ChatGPT and Gemini have them check each other's code. This works very well. During the process I have my own eyes verify the code as well.

by lateforwork

3/4/2026 at 12:35:15 AM

The advice that everyone seemed to agree on at least just a few months ago was to make sure _you_ write the comprehensive tests/specs and this is what I still would recommend doing to anyone asking. I guess even this may be falling out of fashion though…

by void-star

3/4/2026 at 12:40:19 AM

Generate with carefully steered AI, sanity check carefully. For a big enough project writing actually comprehensive test coverage completely by hand could be months of work.

Even state of the art AI models seem to have no taste, or sense of 'hang on, what's even the point of this test' so I've seen them diligently write hundreds of completely pointless tests and sometimes the reason they're pointless is some subtle thing that's hard to notice amongst all the legit looking expect code.

by p1necone

3/4/2026 at 4:18:03 AM

There is no need to write tests manually. Just review the tests, and make sure there is good coverage, if there isn't ask AI for additional tests and give it guidance.

by lateforwork

3/4/2026 at 1:03:42 PM

I was anticipating that having AI write code to pass tests (human and/or AI written tests) would be worthwhile, but in practice, I've found that even models such as Opus 4.6 Thinking, High Effort simply "cheats", or rather, fails to generalize much too often. It's occurred to me that perhaps I need some amount of randomness in the tests to keep the models honest, but it feels wrong. We'll see.

by sdiupIGPWEfh

3/4/2026 at 1:09:07 AM

[dead]

by throwaway613746

3/4/2026 at 7:59:48 AM

I don’t think many people are interested in producing 100% correct code. I saw this at big companies and small startups. It is all the same: ship feature asap before competitors. Writing correct code is almost always punished indirectly in the way they only praise the feature delivering heroes that got promoted. Nobody got promoted for preventing bugs.

Maybe in some other circles it is not like that, but I am sure that 90% of industry measures output in the amount of value produced and correct code is not the value you can show to the stockholders.

It is sad state of affairs dictated by profit seeking way of life (capitalism).

by thorn

3/4/2026 at 8:40:23 AM

Not value produced, but easily quantifiable value produced, I would say. There are lots of cases where some perfectly tweaked internal tool saves hours of work every week, or some mostly invisible (but crucial) project, application, or module just somehow always works. But the user never twigs they exist, and the customer doesn't care. The value is there (in saved hours, frustration, and stability), but it won't count.

by Freak_NL

3/4/2026 at 9:00:12 AM

Yes, perceived value, I wanted to write. Your correction is spot on.

by thorn

3/4/2026 at 8:52:33 AM

Note that Lean doesn’t have a monopoly on verification languages. Dafney, Rocq, Why3, ATS, Agda, Idris… these all can do it too. The fact that Lean is controlled by Microsoft should cause folks pause considering how they are trying to monopolize so many other spaces… & the author works on Lean. Rather than comparing Lean to performance like OCaml/Haskell like the article does, Why3 is a superset of OCaml, Agda can compile to Haskell, & ATS2 compiles to C—rather than needing to adopt an entirely new language.

by toastal

3/4/2026 at 2:09:39 AM

The verification problem scales poorly with AI complexity. Current approaches rely on test suites, but AI-generated code tends to optimize for passing existing tests rather than correctness in the general case.

What's interesting is this might be the forcing function that finally brings formal verification into mainstream use. Tools like Lean and Coq have been technically impressive but adoption-starved. If unverified AI code is too risky to deploy in critical systems, organizations may have no choice but to invest in formal specs. AI writes the software, proof assistants verify it.

The irony: AI-generated code may be what makes formal methods economically viable.

by vicchenai

3/3/2026 at 5:53:09 PM

No one does currently, and its going to take a few very painful and high profile failures of vital systems for this industry to RELEARN its lesson about the price of speed.

In fact it will probably need to happen a few times PER org for the dust to settle. It will take several years.

by acedTrex

3/3/2026 at 6:33:19 PM

I feel like people used to talk about nines of uptime more. As in more than one. These days we've lost that: https://bsky.app/profile/jkachmar.com/post/3mg4u3e6nak2p

I recall a time, maybe around 2013-2017, when people were talking about 4 or 5 nines. But sometime around then the goalposts shifted, and instead of trying to make things as reliable as possible, it started becoming more about seeing how unreliable they can get before anyone notices or cares. It turns out people will suffer through a lot if there's some marginal benefit--remember what personal computers were like in the 1990s before memory protection? Vibe coding is just another chapter in that user hostile epic. Convenient reliability, like this author describes, (if it can be achieved) might actually make things better? But my money isn't on that.

by jcgrillo

3/3/2026 at 6:15:38 PM

Sure but industry cares about value (= benefit - price), not just price. Price could be astronomical, but that doesn’t matter if benefit is larger.

by arscan

3/3/2026 at 6:04:15 PM

You do. Even the latest models still frequently write really weird code. The problem is some developers now just submit code for review that they didn't bother to read. You can tell. Code review is more important than ever imho.

by oakpond

3/3/2026 at 6:16:02 PM

I agree with you. But I have to say, it is an uphill battle and all the incentives are against you.

1. AI is meant to make us go faster, reviews are slow, the AI is smart, let it go.

2. There are plenty of AI maximizers who only think we should be writing design docs and letting the AI go to town on it.

Maybe, this might be a great time to start a company. Maximize the benefits of AI while you can without someone who has never written a line of code telling you that your job is going to disappear in 12 months.

All the incentives are against someone who wants to use AI in a reasonable way, right now.

by sausagefeet

3/3/2026 at 6:33:34 PM

I actually agree with good time to start a company. Lot of available software engineers that can actually understand code, AI at a level that can actually speed up development, and so many startups focusing on AI wrapper slop that you can actually make a useful product and separate yourself from the herd.

Or you can be a grifter and make some AI wrapper yourself and cash out with some VC investment. So good time for a new company either way.

by redhed

3/3/2026 at 8:59:41 PM

It's gonna be like that HBO Silicon Valley bit again, where everyone and their doctor is telling you about their app.

by johnmaguire

3/4/2026 at 1:58:43 AM

The people that are declaring holier-than-thou with self proclaimed 'principles' are the worst grifters and the ones actively scamming with AI with VC investment.

Pretending that they can only save the world and at the same time declaring they don't use AI but use it secretly by building an so-called "AI startup" and then going on the media doomsaying that "AGI" is coming.

At this point in this cycle in AI, "AGI" is just grifting until IPO.

by rvz

3/3/2026 at 6:58:01 PM

But it's so BORING. AI gets to do the fun part (writing code) and I'm stuck with the lame bits.

It's like watching someone else solve a puzzle, or watching someone else play a game vs playing it yourself (at least that's half as interesting as playing it through)

by bradleykingz

3/3/2026 at 7:38:56 PM

Your workplace has chosen to deprive you of the enjoyment that you got from the work. You have a few options: (1) ask for a raise proportional to the percentage of enjoyment that you lost, (2) find a workplace that does not do this, or (3) phone it in (they see you and your craft as something be milked for cash, so maybe stop letting yourself get milked, and milk them right back, by doing _exactly_ what is asked of you and _not_ more -- let these strategic geniuses strategize using their own brains).

by nz

3/3/2026 at 10:58:41 PM

LLMs are still not good at structurally hard problems, and it is doubtful they ever will be absent some drastic extension. (Including continuous learning). In the mean time the trick is creating a framework where you can walk them through the exact stages you would to do it, only it goes way faster. The problem is many people stop at the first iteration that looks like it works and then move on, but you have to keep pushing in the same way you do with humans.

Bluntly though, if what you were doing was CRUD boilerplate then yeah it is going to just be a review fest now, but that kind of work always was just begging to be automated out one way or another.

by mosura

3/3/2026 at 7:37:42 PM

I am really enjoying making requirements docs in an iterative process. I have a continuous improvement loop where I use the implementation to test out the docs. If I find a problem with the docs, I throw away the implementation, improve the docs, then re-implement. The kind of docs I'm getting are of amazing quality.

by HoldOnAMinute

3/3/2026 at 7:05:28 PM

For me the most fun part is getting something that works. Design the goal, but not micromanage and get lost in the details. I love AI for that, but it is hard really owning code this way. (At least I manually approve every or most changes, but still, verifying is hard).

by lukan

3/3/2026 at 7:24:48 PM

AI has really sharpened the line between the Master Builders of the world and the Lord Businesses along this question: What, exactly, is the "fun part" of programming? Is it simply having something that works? Or is it the process of going from not having it to having it through your own efforts and the sum total of decisions you made along the way?

by bitwize

3/4/2026 at 12:44:18 PM

> For me the most fun part is getting something that works. Design the goal, but not micromanage and get lost in the details.

That’s management, which is not necessarily bad by itself.

What is bad if you have to be both a manager giving the requirements and the IC responsible for the output. That’s the worst of both worlds.

If it was possible to truly vibe code in the same way a product manager asks a team to build something then we wouldn’t need to have this discussion.

In reality you can never truly trust the output. Ultimately you’re on the hook if production breaks apart, not the LLM.

by jurgenburgen

3/3/2026 at 7:40:28 PM

I can solve a problem in 10% of the time. Dealing with an issue TODAY, instead of having to put it in the backlog.

by stretchwithme

3/3/2026 at 6:34:00 PM

> The problem is some developers now just submit code for review that they didn't bother to read.

Can you blame them? All the AI companies are saying “this does a better job than you ever could”, every discussion topic on AI includes at least one (totally organic, I’m sure) comment along the lines of “I’ve been developing software for over twenty years and these tools are going to replace me in six months. I’m learning how to be a plumber before I’m permanently unemployed.” So when Claude spits out something that seems to work with a short smoke test, how can you blame developers for thinking “damn the hype is real. LGTM”?

by xienze

3/3/2026 at 6:56:52 PM

I'm an 99% organic person (I suppose I have tooth fillings) and the new models write code better than I do.

I've been using LLMS for 14+ months now and they've exceeded my expectations.

by jf22

3/3/2026 at 7:25:34 PM

So are you learning a trade? Or do you somehow think you’ll be one of the developers “good enough” to remain employed?

by xienze

3/3/2026 at 7:31:48 PM

I have a physical goods side hustle already and I'm brainstorming ideas about a trade I can do that will benefit from my programming experience.

I'm thinking HVAC or painting lines in parking lots. HVAC because I can program smart systems and parking lot lines because I can use google maps and algos to propose more efficient parking lot designs to existing business owners.

There is that paradox when if something becomes cheaper there is more demand so we'll see what happens.

Finally, I'm a mediocre dev that can only handle 2-3 agents at a time so I probably won't be good enough.

by jf22

3/3/2026 at 7:39:05 PM

Not only do they exceed expectations, but any time they fall down, you can improve your instructions to them. It's easy to get into a virtuous cycle.

by HoldOnAMinute

3/3/2026 at 7:12:53 PM

> Can you blame them?

Yes I absolutely can and do blame them

by bluefirebrand

3/3/2026 at 11:44:31 PM

This is correct. And at this point (and I think you agree?) we have to take that critical thinking skill and stop letting it just happen to us.

It might seem hopeless. But on the other hand the innate human BS detector is quite good. Imagine the state of us if we could be programmed by putting billions of dollars into our brains and not have any kind of subconscious filter that tells us, hey this doesn’t seem right. We’ve already tried that for a century. And it turns out that the cure is not billions of dollars of counter-propaganda consisting of the truth (that would be hopeless as the Truth doesn’t have that kind of money).

We don’t have to be discouraged by whoever replies to you and says things like, oh my goodness the new Siri AI replaced my parenting skills just in the last two weeks, the progress is astounding (Siri, the kids are home and should be in bed by 21:00). Or by the hypothetical people in my replies insisting, no no people are stupid as bricks; all my neighbors buy the propaganda of [wrong side of the political aisle]. Etc. etc. ad nauseam.

by keybored

3/3/2026 at 6:09:08 PM

It is remarkably effective to have Claude Code do the code review and assign a quality score, call it a grade, to the contribution derived from your own expectations of quality.

Then don’t even bother looking at C work or below.

by MrDarcy

3/3/2026 at 6:13:22 PM

IME it works even better if you use another model for review. We've seen code by cc and review by gpt5.2/3 work very well.

Also works with planning before any coding sessions. Gemini + Opus + GPT-xhigh works to get a lot of questions answered before coding starts.

by NitpickLawyer

3/3/2026 at 8:46:11 PM

> You do

I really want to say: "You are absolutely right"

But here is a problem I am facing personally (numbers are hypothetical).

I get a review request 10-15/day by 4 teammates, who are generating code by prompting, and I am doing same, so you can guess we might have ~20 PRs/day to review. now each PR is roughly updating 5-6 files and 10-15 lines in each.

So you can estimate that, I am looking at around 50-60 files, but I can't keep the context of the whole file because change I am looking is somewhere in the middle, 3 lines here, 5 lines there and another 4 lines at the end.

How am I supposed to review all these?

by throwaw12

3/3/2026 at 11:22:37 PM

Ideally, you’re working with teammates you trust. The best teams I’ve worked on reviews were a formality. Most of the time a quick scan and a LGTM. We worked together prior to the review as needed on areas we knew would need input from others.

AI changes none of this. If you’re putting up PRs and getting comments, you need to slow down. Slow is smooth, and smooth is fast.

I’ll caveat this with that’s only if your employer cares about quality. If you’re fine passing that on to your users, might as well just stop reviewing all together.

by sjajzh

3/3/2026 at 11:36:58 PM

> Ideally, you’re working with teammates you trust.

I do trust them, but code is not theirs, prompt is. What if I trust them, but because how much they use LLMs their brain started becoming lazy and they started missing edge cases, who should review the code? me or them?

At the beginning, I relied on my trust and did quick scans, but eventually noticed they became un-interested in the craft and started submitting LLM output as it is, I still trust them as good faith actors, but not their brain anymore (and my own as well).

Also, assumption is based on ideal team: where everyone behaves in good faith. But this is not the case in corporations and big tech, especially when incentives are aligned with the "output/impact" you are making. A lot of times, promoted people won't see the impact of their past bad judgements, so why craft perfect code

by throwaw12

3/3/2026 at 11:46:23 PM

Yeah, I agree with you. I’d say they’re not high performers anymore. Best answer I’ve got is find a place where quality matters. If you’re at a body shop it’s not gonna be fun.

I do think some of this is just a hype wave and businesses will learn quality and trust matter. But maybe not - if wealth keeps becoming more concentrated at the top, it’s slop for the plebs.

by sjajzh

3/4/2026 at 1:16:10 AM

My work has turned into churning out a PR, marking it as a draft so no one reviews it, and walking away. I come back after thinking about what it produced and usually realize it missed something or that the implications of some minor change are more far-reaching than the LLM understood. I take another pass. Then, I walk away again. Repeat.

Honestly I'm not sure much has changed with my output, because I don't submit PRs which aren't thoughtful. That is what the most annoying people in my organization do. They submit something that compiles, and then I spend a couple hours of my day demonstrating how incorrect it is.

For small fixes where I can recognize there is a clear, small fix which is easily testable I no longer add them to a TODO list, I simply set an agent off on the task and take it all the way to PR. It has been nice to be able to autopilot mindless changesets.

by devin

3/3/2026 at 10:10:38 PM

If reviewing has become the bottleneck, the obvious - albeit slightly boring - solution is to slow down spitting out new code, and spend relatively more time reviewing.

Just going ahead and piling up PRs or skipping the review process is of course not recommended.

by ptnpzwqd

3/3/2026 at 10:22:43 PM

you are not wrong, but solution you are proposing is just throttling the system because of the bottleneck, and it doesn't solve the bottleneck problem.

by throwaw12

3/3/2026 at 11:02:31 PM

Correct, but that has and probably always will be the case.

You spend the time on what is needed for you to move ahead - if code review is now the most time consuming part, that is where you will spend your time. If ever that is no longer a problem, defining requirements will maybe be the next bottleneck and where you spend your time, and so forth.

Of course it would be great to get rid of the review bottleneck as well, but I at least don't have an answer to that - I don't think the current generation of LLMs are good enough to allow us bypassing that step.

by ptnpzwqd

3/3/2026 at 11:24:29 PM

You know we’ve had the ability to generate large amounts of code for a long time, right? You could have been drowning in reviews in 2018. Cheap devs are not new. There’s a reason this trend never caught on for any decent company.

by sjajzh

3/3/2026 at 11:28:27 PM

I hope you are not bot, because your account was created just 8 minutes ago.

> You know we’ve had the ability to generate large amounts of code for a long time, right?

No, I was not aware. Nothing comes close to the scale of 'coherent looking' code generation of today's tech.

Even if you employ 100K people and ask them to write proper if/else code non-stop, LLM can still outcompete them by a huge margin with much better looking code.

(don't compare it LLM output to codegen of the past, because codegen was carefully crafted and a lot of times were deterministic, I am only talking about people writing code vs LLMs writing code)

by throwaw12

3/3/2026 at 11:42:29 PM

I’m not a bot.

> No, I was not aware. Nothing comes close to the scale of 'coherent looking' code generation of today's tech.

Are you talking about “I’m overwhelmed by code review” or “we can now produce code at a scale no amount of humans can ever review”. Those are 2 very different things.

You review code because you’re responsible for it. This problem existed pre AI and nothing had changed wrt to being overwhelmed. The solution is still the same. To the latter, I think that’s more the software dark factory kind of thinking?

I find that interesting and maybe we’ll get there. But again, the code it takes to verify a system is drastically more complex than the system itself. I don’t know how you could build such a thing except in narrow use cases. Which I do think well see one day, though how narrow they are is the key part.

by sjajzh

3/3/2026 at 8:57:44 PM

I don't quite follow - are you describing an issue with the way your team has structured PRs? IMO, a PR should contain just enough code to clearly and completely solve "a thing" without solving too much at once. But what this means in practice depends on the team, product, velocity, etc. It sounds like your PRs might be broken up into too small of chunks if you can't understand why the code is being added.

by johnmaguire

3/3/2026 at 9:29:24 PM

I am saying PRs I get are around 60-70 lines of change, which is small enough to be considered as single unit (add to this unit tests which must pass with new change, so we are talking about 30 line change + 30 line unit test)

But when looking at the PR changes, you don't always see whole picture because review subjects (code lines) are scattered across files and methods, and GitHub also shows methods and files partially making it even more difficult to quickly spot the context around those updated lines.

Its difficult problem, because even if GitHub shows whole body of the updated method or a file, you still don't see grand picture.

For example: A (calls) -> B -> C -> D

And you made changes in D, how do you know the side effect on B, what if it broke A?

by throwaw12

3/3/2026 at 10:53:58 PM

If the code is well architected, the contract between C and D should make it clear whether changes in D affect C or not. And if C is not affected, then B and A won't be either.

by FartyMcFarter

3/3/2026 at 11:46:55 PM

> If the code is well architected

Big constraint. Code changes, initial architecture could have been amazing, but constantly changing business requirements make things messy.

Please don't use, "In ideal world" examples :) Because they are singular in vast space of non-ideal solutions

by throwaw12

3/3/2026 at 11:57:29 PM

In that case your problem is bigger than just reviewing changes. You need to point the fingers at the bad code and bad architecture first.

There's no way to make spaghetti code easy to review.

by FartyMcFarter

3/3/2026 at 10:51:07 PM

> Its difficult problem, because even if GitHub shows whole body of the updated method or a file, you still don't see grand picture.

> For example: A (calls) -> B -> C -> D

> And you made changes in D, how do you know the side effect on B, what if it broke A?

That's poor encapsulation. If the changes in D respect its contract, and C respects D's contract, your changes in D shouldn't affect C, much less B or A.

by cesarb

3/3/2026 at 11:44:38 PM

> That's poor encapsulation

That's the reality of most software built in last 20 years.

> If the changes in D respect its contract, and C respects D's contract, your changes in D shouldn't affect C, much less B or A.

Any changes in D, eventually must affect B or A, it's inevitable, otherwise D shouldn't exist in call stack.

How the case I mentioned can happen, imagine in each layer you have 3 variations: 1 happy path 2 edge case handling, lets start from lowest:

D: 3, C: 3D=9, B: 3C=27, A: 3*B=81

Obviously, you won't be writing 81 unit tests for A, 27 for B, you will mock implementations and write enough unit tests to make the coverage good. Because of that mocking, when you update D and add a new case, but do not surface relevant mocking to upper layers, you will end up in a situation where D impacts A, but its not visible in unit tests.

While reading the changes in D, I can't reconstruct all possible parent caller chain in my brain, to ask engineer to write relevant unit tests.

So, case I mentioned happens, otherwise in real world there would be no bugs

by throwaw12

3/3/2026 at 11:27:14 PM

Leaky abstractions are a thing. You can just encapsulate your way out of everything.

by sjajzh

3/4/2026 at 12:11:32 AM

check out the branch. if the changes are that risky, the web ui for your repository host is not suitable for reviewing them.

the rest of your issues sound architectural.

if changes are breaking contracts in calling code, that heavily implies that type declarations are not in use, or enumerable values which drive conditional behavior are mistyped as a primitive supertype.

if unit tests are not catching things, that implies the unit tests are asserting trivial things, being written after the implementation to just make cases that pass based on it, or are mocking modules they don't need to. outside of pathological cases the only thing you should be mocking is i/o, and even then that is the textbook use for dependency injection.

by mdarens

3/3/2026 at 9:53:58 PM

Tests. All changes must have tests. If they're generating the code, they can generate the tests too.

by jra_samba

3/3/2026 at 10:24:17 PM

who reviews the tests? again me? -> that's exactly why I am saying review is a bottleneck, especially with current tooling, which doesn't show second order impacts of the changes and its not easy to reason about when method gets called by 10 other methods with 4 level of parent hierarchy

by throwaw12

3/3/2026 at 10:15:08 PM

TFA seems to be big on mathematical proof of correctness, but how do you ever know you're proving the right thing?

by chromaton

3/4/2026 at 10:45:19 AM

At the end it mentions what the future engineers will do:

> Engineers spend more time writing specifications and models, designing systems at a higher level of abstraction, defining precisely what systems must do, what invariants they must maintain, what failures they must tolerate.

We do that already and the abstractions are very high. The other part is about knowing what the system is supposed to do way in advance, which is not how a lot of engineering is done because it is an exploratory problem. Very few of us write crypto or spend much time in a critical piece of code. And most importantly no user ever said if the software they buy is using proofs. Just like security these concerns are at the bottom of a barrel.

by sublimefire

3/3/2026 at 8:05:20 PM

This is the biggest problem going forward. I wrote about the problem many times on my blog, in talks, and as premises in my sci-fi novels

Sitting in your cubical with your perfect set of test suites, code verification rules, SOP's and code reviews you wont want to hear this, but other companies will be gunning for your market; writing almost identical software to yours in the future from a series of prompts that generate the code they want fast, cheap, functionally identical, and quite possibly untested.

As AI gets more proficient and are given more autonomy (OpenClaw++) they will also generate directly executable binaries completely replacing the compiler, making it unreadable to a normal human, and may even do this without prompts.

The scenario is terrifying to professional software developers, but other people will do this regardless of what you think, and run it in production, and I expect we are months or just a few years away from this.

Source code of the future will be the complete series of prompts used to generate the software, another AI to verify it, and an extensive test suites.

by boznz

3/3/2026 at 10:38:33 PM

If you need to interact with some things in platform.openai.com, you know it is not months away, it is there already now. I had to go through forms and flows there, so buggy and untested, simply broken. They really eat their own dog food. Interacting with the support, resulted in literally weeks of ping pong between me and AI smoothed replies via email to fix their bugs. Terrible.

by Aldipower

3/3/2026 at 10:28:18 PM

How do you get an extensive test suite?

by skydhash

3/4/2026 at 3:03:23 AM

I wish I had a crystal ball to know how this will play out in the future, will a different AI create it from the SOP? How will humans fit in?

For me with any new release of my winery production software I re-ran every job put into my clients production systems from job #1, about 200,000 jobs; Before going into production we checked all the balances, product compositions and inventory from this new software matches what the old system currently says. Takes about an hour to re-run fifteen years production and even a milliliter, or milligram difference was enough to trigger a stop/check. We had an extensive set of input data I could also feed in, to ensure mistakes were caught there too.

I expect other people do it there own way, but as a business this would be the low bar of testing I would expect to be done.

by boznz

3/4/2026 at 12:18:12 AM

When humans write the software, who verifies it?

half sarcasm, half real-talk.

TDD is nice, but human coders barely do it. At least AI can do it more!

by bfung

3/4/2026 at 2:05:16 AM

> half sarcasm, half real-talk.

If you could pause a bit from being awed by your own perceived insightfulness, you would think a just bit harder and realize that LLMs can generate hundreds of thousands of code that no human could every verify within a finite amount of time. Human-written software is human verifiable, AI-assisted human-written software is still human verifiable to some extent, but purely AI-written software can no longer be verified by humans.

by nvlled

3/4/2026 at 9:22:45 AM

That's a very unpleasant tone to take.

by nicbou

3/3/2026 at 10:31:49 PM

LLM generated code combined with formal verification just feels like we're entering the most ridiculous timeline. We know formal verification doesn't work at scale, hence we don't use it. We might get fully vibe coded systems but we sure as hell won't be able to verify them.

The collapse of civilisation is real.

by slopinthebag

3/4/2026 at 10:00:21 AM

A little mental trick: every time you see a title with AI(or LLM or agent) and a question mark, replace 'AI' with 'humans' and ask the question again.

The answer is no one, for most of the time.

by raincole

3/3/2026 at 10:54:50 PM

100% of my innovation for the past month has been getting the coding agent to iterate with an OODA loop (I make it validate after act step) trying to figure out how to get it to not stop iterating.

For example, I have discovered there is a big difference between prompting 'there is a look ahead bias' and 'there is a [T+1] look ahead bias' where the later will cause it to not stop until it finds the [T+1] look ahead bias. It will start to write scripts that will `.shift(1)` all values and do statistical analysis on the result set trying to find the look ahead bias.

Now, I know there isn't look ahead bias, but the point is I was able to get it to iterate automatically trying different approaches to solve the problem.

The software is going to verify itself eventually, sooner than later.

by dataviz1000

3/4/2026 at 12:43:37 AM

Your post is not clear to me. What did you innovate? Your example is unclear.

by macrolet

3/4/2026 at 3:16:47 PM

the test generation loop is brutal. i've been burned by this exact thing, you ask the agent to write code, then ask it to write tests for that code, and surprise, they all pass because the tests are literally just "does the code do what the code does"

honestly think the answer isn't more tests, it's stricter contracts. like if your API has an OpenAPI spec, you can validate requests/responses against it automatically. the spec becomes the source of truth, not the tests, not the implementation

we've been doing this backwards for years. write code, write tests that match the code, realize six months later that both the code and tests were implementing the wrong behavior. but if you have a machine-readable contract (openapi, json schema, whatever), at least you can verify one dimension automatically

ngl this is why i'm skeptical of "AI will write all the code" takes. without formal specs, you're just getting really confident garbage that happens to pass its own tests. which tbh describes a lot of human-written code too lol

by MickeyShmueli

3/3/2026 at 11:32:32 PM

The users verify and fix it on the fly with the Claude VM JIT https://jperla.com/blog/claude-is-a-jit

by ljlolel

3/4/2026 at 12:01:56 PM

Attention human: I have formally verified a route for converting the desks in the office into paperclips. Proceeding autonomously.

by liampulles

3/3/2026 at 7:54:39 PM

One thing that seems under-discussed in this space is the shift from verifying programs to verifying generation processes.

If a piece of code is produced by an agent loop (prompt -> tool calls -> edits -> tests), the real artifact isn’t just the final code but the trace/pipeline that produced it.

In that sense verification might look closer to: checking constraints on the generator (tests/specs/contracts), verifying the toolchain used by the agent, and replaying generation under controlled inputs.

That feels closer to build reproducibility or supply-chain verification than traditional program proofs.

by 50lo

3/3/2026 at 5:42:08 PM

No one really. Code is for humans to read and for machines to compile and execute. Llms are enabling people to just write the code and not have anyone read it. It’s solving a problem that didn’t really exist (we already had code generators before llms).

It’s such an intoxicating copyright-abuse slot machine that a buddy who is building an ocaml+htmx tree editor told me “I always get stuck and end up going to the llm to generate code. Usually when I get to the html part.” I asked if he used a debugger before that, he said “that’s a good idea”.

by righthand

3/3/2026 at 6:14:14 PM

This is something I've been wondering about...

If boilerplate was such a big issue, we should have worked on improving code generation. In fact, many tools and frameworks exist that did this already:

- rails has fantastic code generation for CRUD use cases

- intelliJ IDEs have been able to do many types of refactors and class generation that included some of the boilerplate

I haven't reached a conclusion on this train of thought yet, though.

by galbar

3/3/2026 at 9:40:23 PM

Pre-llm corpos my thoughts were that we should be training juniors on code generators. Instead we’re somewhere between rtfm or dont.

by righthand

3/4/2026 at 11:49:44 AM

I think LLMs didn't remove the need to verify code and shifted the failure modes from syntax mistakes to subtle design and security regressions that slip past casual eyeballs. In my experience with OCaml and htmx the right belt and suspenders are the OCaml type checker, ppx_expect snapshot tests for generated markup, QCheck property tests for invariants, and a CI step that runs html-validate plus a simple fuzz of input values to expose XSS or broken attributes. I've found the practical lesson is to treat LLM output like a junior dev, keep prompts small and testable, require human review for security or API design, and accept that you buy speed now but pay with maintenance debt at 2am.

by hrmtst93837

3/4/2026 at 3:41:47 PM

I stopped worrying about that and just write code and do the debugging. Why pay for speed when the cruising lane is perfectly acceptable?

by righthand

3/4/2026 at 12:03:55 AM

Formal verification gets you to deploy with confidence but it's still a snapshot. What happens when real-world inputs drift from what you tested against? The subtler problem is runtime behavioral drift: an agent that's technically correct but consistently misunderstands a whole class of user queries is invisible to any pre-deploy check. Pre-deploy and post-deploy verification are genuinely different problems.

by shubhamintech

3/4/2026 at 12:22:24 AM

> What happens when real-world inputs drift from what you tested against?

The whole point of formal verification is that you don't test. You prove the program correct mathematically for any input.

> an agent that's technically correct but consistently misunderstands a whole class of user queries is invisible to any pre-deploy check

The agent isn't verifying the program. The agent is writing the code that proves the program correct. If the agent misunderstands, it fails to verify the program.

by hackyhacky

3/4/2026 at 2:19:07 AM

I am experimenting at a very early stage with using Verus in Rust to generate proveably correct Rust. I let the AI bang on the proof and trust the proof assistant to confirm it.

There is another route with Lean where Rust generates the Lean and there is proof done there but I haven't chased that down fully.

I think formal verification is a big win in the LLM era.

by pnathan

3/4/2026 at 4:52:06 AM

Human writes the requirements, contains flaw. Human or AI translate that to specifications, and eventually code.

It does not matter if the middle man is human or AI, or written in "traditional language" or "formal verification". Bugs will be there as human failed to defined a bullet proof requirements.

by phantomathkg

3/3/2026 at 10:22:18 PM

The biggest issue right now is most AI tools aren't hooked up appropriately to an environment they can test in (Chrome typically). Replit works extremely well because it has an integrated browser and testing strategy. AI works very well when it has the ability to check its own work.

by anonhacker199

3/3/2026 at 6:21:51 PM

At the end of the day you need humans who understand the business critical (or safety critical) systems that underpin the enterprise.

Someone needs to be held accountable when things go wrong. Someone needs to be able to explain to the CEO why this or that is impossible.

If you want to have AI generate all the code for your business critical software, fine, but you better make sure you understand it well. Sometimes the fastest path to deep understanding is just coding things out yourself - so be it.

This is why the truly critical software doesn’t get developed much faster when AI tools are introduced. The bottleneck isn’t how fast the code can be created, it’s how fast humans can construct their understanding before they put their careers on the line by deploying it.

Ofc… this doesn’t apply to prototypes, hackathons, POCs, etc. for those “low stakes” projects, vibe code away, if you wish.

by holtkam2

3/3/2026 at 11:36:26 PM

I think this gets to the heart of it. We’re gonna see a new class of devs & software emerge that only use AI and don’t read the code. The devs that understand code will still exist too, but there is certainly an appetite for going faster at the cost of quality.

I personally find the “move fast and break thing” ethos morally abhorrent, but that doesn’t make it go away.

by sjajzh

3/4/2026 at 12:48:41 AM

Someone once told me, agentic coding may lead to something akin to a "software engineering union" forming, where a set of guidelines control code quality. Namely, at least one of writing, testing, and reviewing of code must be done by a human.

by p0u4a

3/4/2026 at 6:00:07 AM

i think formal methods and math proofs will be useful like tests are for getting more feedback to the LLM to get to a working solution, but i dont think it at all solves the problem of "poisoned training data introduces specific bugs and vulnerabilities"

the bug will also be introduced in the formal spec, and people will still miss it by not looking.

i think fast response and fix time - anti-entropy - will win out against trying to increase the activation energy, to quote the various S3 talks. You need a cleanup method, rather than to prevent issues in the first place

by 8note

3/4/2026 at 2:20:46 PM

It is wild to me how things SEEM to be playing out. With AI I code faster …. AND read / test more.

I’m doing more verification than ever.

by duxup

3/4/2026 at 2:26:08 PM

Yeah exactly. I can't believe that people keep acting like everyone's just running hundreds of lines and no one's doing a code review. I find that I'm fastest when I iterate with it. If it's anything that's slightly ambiguous, I'm going to check as it goes and mention things during the edits it suggests.

by zackify

3/4/2026 at 12:23:20 PM

When AI writes the software, who is proud of it?

by amelius

3/4/2026 at 3:39:50 AM

I do. I verify it so hard I've begun to mistrust it lately, seeing Gemini make glaring conceptual mistakes and refusing to align to my corrections.

by midtake

3/4/2026 at 5:16:02 AM

I think intelligence in general means solving (and manipulating) constraint problems. So when you ask AI to, say, write a "snake game", it figures out what this means in terms of constraints to all the possible source codes that can be written (so it will have things like the program is a game, so it has a score, there is a simple game world and there is user input connected to the display of the game world and all sorts of constraints like this), and then it further refines these constraints until it picks a point (from the space of all possible programs) that satisfies those constraints, more or less.

One beautiful thing about current AI is that this process can handle fuzzy constraints. So you don't have to describe the requirements (constraints) exactly, but it can work with fuzzy sets and constraints (I am using "fuzzy" in the quite broad sense), such as "user can move snake head in 4 directions".

Now, because of this fuzzy reasoning, it can sometimes fail. So the wrong point (source code) can get picked from the fuzzy set that represents "snake game". For example, it can be something buggy or something less like a canonical snake game.

In that case of the failure, you can either sample another datapoint ("write another snake game"), or you can add additional constraints.

Now, the article argues in favor of formal verification, which essentially means, somehow convert all these fuzzy constraints into hard constraints, so then when we get our data point (source code of the snake game), we can verify that it indeeds belongs to the (now exact) set of all snake games.

So, while it can help with the sampling problem, the alignment problem still remains - how can we tell that the AI's (fuzzy) definition of a functional "snake game" is in line with our fuzzy definition? So that is something we don't know how to handle other than iteratively throwing AIs at many problems and slowly getting these definitions aligned with humans.

And I think the latter problem (alignment with humans on definitions) is the real elephant in the room, and so the article is IMHO focusing on the wrong problem by thinking the fuzzy nature of the constraints is the main issue.

Although I think it would definitely be useful if we had a better theoretical grasp on how AI handles fuzzy reasoning. As AI stands now, practicality has beaten theory. (You can formalize fuzzy logic in Lean, so in theory nothing prevents us from specifying fuzzy constraints in a formal way and then solving the resulting constraint problem formally, it just might be quite difficult, like solving an equation symbolically vs numerically.)

by js8

3/4/2026 at 3:08:22 AM

This is where you can use adverserial systems.

1. Agent writes a minimal test against a spec. 2. Another agent writes minimal implementation to make test pass only.

Repeat

This is ping pong pair programming.

by turlockmike

3/3/2026 at 11:23:52 PM

Verify? Seems like no one is even reviewing this stuff.

by phyzome

3/3/2026 at 7:57:30 PM

> Where this leads is clear. Layer by layer, the critical software stack will be reconstructed with mathematical proofs built in. The question is not whether this happens, but when.

by mentalgear

3/4/2026 at 9:53:33 AM

Most software the companies I worked for that was put into production, was not verified. There were spotty code reviews, mostly focusing on "I would have done it differently" and a limited amount of unit tests with low test coverage and Heisenberg E2E tests, often turned off because, Heisenberg. Sometimes overworked, bottle neck, testers.

There is hope that with AI we get to better tested, better written, better verified software.

by KingOfCoders

3/4/2026 at 10:25:48 AM

> There is hope that with AI we get to better tested, better written, better verified software.

And it is one thing we don't get for sure.

This tech, in a different world, could be empowering common people and take some weight from their shoulders. But in this world its purpose is quite the opposite.