3/4/2026 at 1:59:24 AM
> The Claude C Compiler illustrates the other side: it optimizes for> passing tests, not for correctness. It hard-codes values to satisfy
> the test suite. It will not generalize.
This is one of the pain points I am suffering at work: workers ask coding agents to generate some code, and then to generate test coverage for the code. The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code. At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").
The icing on the cake is that LLMs are producing so much code that humans are just rubber stamping all of it. Off to merge and build it goes.
I have no constructive recommendations; I feel the industry will keep their foot on the pedal until something catastrophic happens.
by roadbuster
3/4/2026 at 4:53:15 AM
This is why you write the tests first and then the code. Especially when fixing bugs, since you can be sure that the test properly fails when the bug is present.by WhyNotHugo
3/4/2026 at 6:22:37 AM
When fixing bugs, yes. When designing an app not so much because you realize many unexpected things while writing the code and seeing how it behaves. Often the original test code would test something that is never built. It's obvious for integration tests but it happens for tests of API calls and even for unit tests. One could start writing unit tests for a module or class and eventually realize that it must be implemented in a totally different way. I prefer experimenting with the implementation and write tests only when it settles down on something that I'm confident it will go to production.by pmontra
3/4/2026 at 8:02:11 AM
Where I'm at currently (which may change) is that I lay down the foundation of the program and its initial tests first. That initial bit is completely manual. Then when I'm happy that the program is sufficiently "built up", I let the LLM go crazy. I still audit the tests though personally auditing tests is the part of programming I like the very least. This also largely preserves the initial architectural patterns that I set so it's just much easier to read LLM code.In a team setting I try to do the same thing and invite team members to start writing the initial code by hand only. I suspect if an urgent deliverable comes up though, I will be flexible on some of my ideas.
by Karrot_Kream
3/4/2026 at 4:23:39 PM
> When fixing bugs, yes.One thing I want to mention here is that you should try to write a test that not only prevents this bug, but also similar bugs.
In our own codebase we saw that regression on fixed bugs is very low. So writing a specific test for it, isn't the best way to spend your resources. Writing a broad test when possible, does.
Not sure how LLM's handle that case to come up with a proper test.
by koonsolo
3/5/2026 at 9:23:40 PM
[dead]by boristane007
3/4/2026 at 12:54:30 PM
I'd argue the AI writing the tests shouldn't even know about the implementation at all. You only want to pass it the interface (or function signatures) together with javadocs/docstrings/equivalent.by pipecmd
3/4/2026 at 6:02:12 AM
Agreed 1000%. But that can be a lot of work; creating a good set of tests is nearly as much or often even more effort than implementing the thing being tested.When LLMs can assist with writing useful tests before having seen any implementation, then I’ll be properly impressed.
by usefulcat
3/4/2026 at 6:33:55 AM
from experience, AI is bad at TDD. they can infer tests based on written code, but are bad at writing generalised test unless a clear requirement is given, so you the engineer is doing most of the work anyway.by byzantinegene
3/4/2026 at 4:42:20 PM
My day job has me working on code that is split between two different programming languages. I'd say LLMs are pretty good at TDD in one of those languages and a hot mess in the other.Which, funny enough, is a pretty good reflection of how I thought of the people writing in those languages before LLMs: One considers testing a complete afterthought and in the wild it is rare to find tests at all, and when they are present they often aren't good. Whereas the other brings testing as a first-class feature and most codebases I've seen generally contain fairly decent tests.
No doubt LLM training has picked up on that.
by 9rx
3/4/2026 at 1:22:26 PM
I don't think it addresses the problem.Writing the tests first and then writing code to pass the tests is no better than writing the code first then writing tests that pass. What matter is that both the code and the tests are written independently, from specs, not from one another.
I think that it is better not to have access to tests when first writing code, as to make sure to code the specs and not code the tests that test the specs as something may be lost in translation. It means that I have a preference for code first, but the ideal case would be for different people to do it in parallel.
Anyway, about AI, in an AI writes both the tests and the code, it will make sure they match no matter what comes first, it may even go back and forth between the tests and code, but it doesn't mean it is correct.
by GuB-42
3/4/2026 at 1:57:25 PM
Tests are your spec. You write them first because that is the stage when you are still figuring out what you need to write.Although TDD says that you should only write one test before implementing it, encouraging spec writing to be an iterative process.
Writing the spec after implementation means that you are likely to have forgotten the nuance that went into what you created. That is why specs are written first. Then the nuance is captured up front as it comes to mind.
by 9rx
3/4/2026 at 3:00:04 PM
Tests are not any more or any less of a spec than the code. If you are implementing a HTTP server for instance, RFC 7231 are your specs, not your tests, not your code.I would say that which come first between specs and code depend on the context. If you are implementing a standard, the specs of the standard obviously come first, but if you are iterating, maybe for a user interface, it can make sense to start with the code so that you can have working prototypes. You can then write formal documents and tests later, when you are done prototyping, for regression control.
But I think that leaning on tests is not always a good idea. For example, let's continue with the HTTP server. You write a test suite, but there is a bug in your tests, I don't know, you confuse error 404 and 403. The you write your code, correctly, run the tests, see that one of your tests fail and tell you have returned 404 and not 403. You don't think much, after all "the tests are the specs", and change the code. Congratulations, you are now making sure your code is wrong.
Of course, the opposite can and do happen, writing the code wrong and making passing test without thinking about what you actually testing, and I believe that's why people came up with the idea of TDD, but for me, test-first flip the problem but doesn't solve it. I'd say the only advantage, if it is one, is that it prevents taking a shortcut and releasing untested code by moving tests out of the critical path.
But outside of that, I'd rather focus on the code, so if something are to be "the spec", that's it. It is the most important, because it is the actual product, everything else is secondary. I don't mean unimportant, I mean that from the point of view of users, it is better for the test suite to be broken than for the code to be broken.
by GuB-42
3/4/2026 at 3:28:54 PM
> RFC 7231 are your specsIt is more like a meta spec. You still have to write a final spec that applies to your particular technical constraints, business needs, etc. RFC 7231 specifies the minimum amount necessary to interface with the world, but an actual program to be deployed into the wild requires much, much more consideration.
And for that, since you have the full picture not available to a meta spec, logically you will write it in a language that both humans and computers can understand. For the best results, that means something like Lean, Rocq, etc. However, in the real world you likely have to deal with middling developers straight out of learn to code bootcamps, so tests are the practical middle ground.
> I don't know, you confuse error 404 and 403.
Just like you would when writing RFC 7231? But that's what the RFC process is for. You don't have to skip the RFC process just because the spec also happens to be machine readable. If you are trying to shortcut the process, then you're going to have this problem no matter what.
But, even when shortcutting the process, it is still worthwhile to have written your spec in a machine-readable format as that means any changes to the spec automatically identify all the places you need to change in implementation.
> writing the code wrong and making passing test without thinking about what you actually testing
The much more likely scenario is that the code is right, but a mistake in the test leads it to not test anything. Then, years down the road after everyone has forgotten or moved on, when someone needs to do some refactoring there is no specification to define what the original code was actually supposed to do. Writing the test first means that you have proven that it can fail. That's not the only reason TDD suggests writing a test first, but it is certainly one of them.
> It is the most important, because it is the actual product
Nah. The specification is the actual product; it is what lives for the lifetime of the product. It defines the contract with the user. Implementation is throwaway. You can change the implementation code all day long and as long as the user contract remains satisfied the visible product will remain exactly the same.
by 9rx
3/4/2026 at 10:35:37 PM
> The much more likely scenario is that the code is right, but a mistake in the test leads it to not test anything.What I usually do to prevent this situation is to write a passing test, then modify the code to make it fail, then revert the change. It also gives an occasion to read the code again, kind of like a review.
I have never seen this practice formalized though, good for me, this is the kind of things I do because I care, turning it into a process with Jira and such is a good way to make me stop caring.
by GuB-42
3/5/2026 at 11:18:07 AM
> I have never seen this practice formalized thoughIsn't that what is oft known as mutation testing? It is formalized to the point that we have automation to do the mutation for you automatically.
by 9rx
3/5/2026 at 5:59:35 PM
Thank you, I wasn't aware of this, this is the kind of thing I wish people were more aware of, kind of like fuzzing, but for tests.About fuzzing, I have about 20 years of experience in development and I have never seen fuzzing being done as part of a documented process in a project I worked in, not even once. Many people working in validation don't even know that it exists! The only field where fuzzing seems to be mainstream is cybersecurity, and most fuzzing tools are "security oriented", which is nice but it doesn't mean that security is the only field where it is useful.
Anyways, what I do is a bit different in that it is not random like fuzzing, it is more like reverse-TDD. TDD starts with a failing test, then, you write code to pass the test, and once done, you consider the code to be correct. Here you start with a passing test, then, you write code to fail the test, and once done, you consider the test to be correct.
by GuB-42
3/4/2026 at 2:08:58 PM
Also, if you find after implementation that the spec wasn't specific enough, go ahead and refresh the spec and have the LLM redo the code, from scratch if necessary. Writing code is so cheap right now, it takes a different mindset in general.by mycall
3/4/2026 at 10:20:35 AM
try this for a UIby dustingetz
3/4/2026 at 3:55:12 AM
the test generation loop is the real trap. you ask the agent to write code, then ask it to write tests for that code. of course the tests pass. they're testing what the code does, not what it should do.we ran into this building a task manager. the PUT endpoint set completed=true but never set the completion timestamp. the agent-written tests all passed because they tested "does it set completed to true" not "does it record when it was completed." 59 tasks in production with null timestamps before a downstream report caught it.
the fix was trivial. the gap in verification wasn't.
by MartyMcBot
3/4/2026 at 4:31:07 AM
Once upon a time people advocated writing tests first…by tbossanova
3/4/2026 at 4:51:14 AM
once upon a time 'engineering' in software had some meaning attached to it...no other engineering profession would accept the standards(or rather their lack of) on which software engineering is running.
by codegladiator
3/4/2026 at 8:02:57 AM
> no other engineering profession would accept the standards(or rather their lack of) on which software engineering is running.I have bad news for you: they are pushing those "standards" (Agile, ASPICE) also in hardware and mechanical engineering.
The results can be seen already. Testing is expensive and this is the field where most savings can be implemented.
by hulitu
3/4/2026 at 12:39:04 PM
Agile isn't a coding standard or approach.by philipallstar
3/4/2026 at 11:10:43 AM
Once upon a time people were thinking about what they're doing. LLMs absolve people from thinkingby eithed
3/4/2026 at 11:38:50 AM
Engineers aren't paid to think. They are paid to be replacable cogs who can be fired the moment they show independent thought.by harimau777
3/4/2026 at 5:45:03 AM
i dont think that would help. the agent would hard code the test details into the code.by 8note
3/4/2026 at 7:24:39 AM
I wasn't able to force the agent to write failing tests yet. Although I'm sure it should be possible to do.by scotty79
3/4/2026 at 7:44:55 AM
I do that all the time with Claude. What part is not working?by himlion
3/4/2026 at 9:29:25 AM
I don't really use anthropic models. But when I tried it with others they can write tests but they never confirm that they fail before they proceed to make implementation that causes them to pass. Maybe I didn't prompt it forcibly enough.by scotty79
3/4/2026 at 12:39:59 PM
I haven’t tried this (yet), but I’ve heard of people disabling write access to test code while the agent is writing implementation and vice versa. I imagine “disabling” could be done via prompting, or just a quick one liner like: chmod -r 0644 ./testsby catlifeonmars
3/4/2026 at 1:30:22 PM
The magic word is "use red/green testing", that makes it create the tests first, confirm they fail (as they should), then it writes the code to match.by theshrike79
3/5/2026 at 9:23:49 PM
[dead]by boristane007
3/4/2026 at 5:57:19 AM
At my job we have a requirement for 100% test coverage. So everyone just uses AI to generate 10,000 line files of unit tests and nobody can verify anything.by porphyra
3/4/2026 at 11:37:08 AM
Exactly! It's frustrating how much developers get blamed for the outcomes of incompetent management.by harimau777
3/4/2026 at 12:38:45 PM
> everyone just uses AI to generate 10,000 line files of unit tests and nobody can verify anythingThis is not a guaranteed outcome of requiring 100% coverage. Not that that's a good requirement, but responding badly to a bad requirement is just as bad.
by philipallstar
3/4/2026 at 10:52:06 PM
> The icing on the cake is that LLMs are producing so much code that humans are just rubber stamping all of it.I don't understand the value of that much code. What features are worth that much more than stability?
by HWR_14
3/4/2026 at 2:53:11 AM
> At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").Obvious question: why not? Let’s say you have competent devs, fair assumption. Maybe it’s because they don’t have enough time for solid QA? Lots of places are feature factories. In my personal projects I have more lines of code doing testing than implementation.
by Herring
3/4/2026 at 3:04:08 AM
It’s because people will do what they’re incentivized to do. And if no one cares about anything but whether the next feature goes out the door, that’s what programmers will focus on.Honestly I think the other thing that is happening is that a lot of people who know better are keeping their mouths shut and waiting for things to blow up.
We’re at the very peak of the hype cycle right now, so it’s very hard to push back and tell people that maybe they should slow down and make sure they understand what the system is actually doing and what it should be doing.
by sarchertech
3/4/2026 at 3:09:56 AM
Or if you say we should slow down your competence is questioned by others who are going very fast (and likely making mistakes we won't find until later).And there is an element of uncertainty. Am I just bad at using these new tools? To some degree probably, but does that mean I'm totally wrong and we should be going this fast?
by shigawire
3/4/2026 at 12:45:45 PM
There is a saying: slow is smooth and smooth is fast.I have personally outpaced some of my more impatient colleagues by spending extra time up front setting up test harnesses, reading specifications, etcetera. When done judiciously it pays off in time scales of weeks or less.
by catlifeonmars
3/4/2026 at 6:18:10 AM
oh yeah, let them dig a hole and charge sweet consultant rates to fix it. the the healing can beginby citizenpaul
3/4/2026 at 11:40:23 AM
Developers aren't given time to test and aren't rewarded if they do, but management will rain down hellfire upon their heads if they don't churn out code quickly enough.by harimau777
3/4/2026 at 3:18:27 AM
How about a subsequent review where a separate agent analyzes the original issue and resultant code and approves it if the code meets the intent of the issue. The principle being to keep an eye out for manual work that you can describe well enough to offload.Depending on your success rate with agents, you can have one that validates multiple criteria or separate agents for different review criteria.
by ojo-rojo
3/4/2026 at 4:12:40 AM
You are fighting nondeterministic behavior with more nondeterministic behavior, or in other words, fighting probability with probability. That doesn't necessarily make things any better.by g947o
3/4/2026 at 4:44:37 AM
In my experience, an agent with "fresh eyes", i.e., without the context of being told what to write and writing it, does have a different perspective and is able to be more critical. Chatbots tend to take the entire previous conversational history as a sort of canonical truth, so removing it seems to get rid of any bias the agent has towards the decisions that were made while writing the code.I know I'm psychologizing the agent. I can't explain it in a different way.
by pyridines
3/4/2026 at 6:28:29 AM
Fresh eyes, some contexts and another LLM.The problem is information fatigue from all the agents+code itself.
by Foobar8568
3/4/2026 at 6:15:07 AM
I think of it as they are additive biased. ie "dont think about the pink elephant ". Not only does this not help llms avoid pink elphants instead it guarantees that pink elephant information is now being considered in its inference when it was not before.I fear thinking about problem solving in this manner to make llms work is damaging to critical thinking skills.
by citizenpaul
3/4/2026 at 4:53:35 AM
Aren't human coders also nondeterministic?Assigning different agents to have different focuses has worked for me. Especially when you task a code reviewer agent with the goal of critically examining the code. The results will normally be much better than asking the coder agent who will assure you it's "fully tested and production ready"
by hex4def6
3/4/2026 at 11:05:34 AM
Human coders are far more reliable. The only downside is speed, and therefore costby samrus
3/4/2026 at 4:31:44 AM
Probably true(Sorry.)
by tbossanova
3/4/2026 at 11:04:33 AM
Slop on slop. Who watches rhe watchman?by samrus
3/4/2026 at 8:50:40 AM
How long till the industry discover TDD?by yanis_t
3/4/2026 at 5:47:17 AM
> At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").its fun having LLMs because it makes it quite clear that a lot of testing has been cargo-culting. did people ever check often that the tests check for anything meaningful?
by 8note
3/4/2026 at 6:26:07 AM
15years ago, I had tester writing "UI tests" / "User tests" that matched what the software was cranking out. At that time I just joined to continue at the client side so I didn't really worked on anything yet.I had a fun discussion when the client tried to change values... Why is it still 0? Didn't you test?
And that was at that time I had to dive into the code base and cry.
by Foobar8568
3/4/2026 at 2:01:53 PM
Test automation is kind of like a religion. It is comforting to believe that the solution to code is more code.by mattacular
3/4/2026 at 6:23:00 AM
Property testing could've helpedby taatparya
3/4/2026 at 1:10:17 PM
Building a C compiler should not have this problem. There is probably a million test suites coming from outside the LLM that it can sue verify correctness.by Illniyar
3/4/2026 at 8:05:30 AM
I think it boils down to how companies view LLMs and their engineers.Some companies will do as you say - have (mostly clueless) engineers feed high level "wishes" to (entirely clueless) LLMs, and hope that everyone kind of gets it. And everyone will kind of get it. And everyone will kind of get it wrong.
Other companies will have their engineers explicitly treat the LLMs as collaborators / pair programmers, not independent developers. As an engineer in such a company, YOU are still the author of the code even if you "prompted" it instead of typing it. You can't just "fix this high level thing for me brah" and get away with it, but instead need to continuously interact with the LLM as you define and it implements the detailed wanted behaviors. That forces you to know _exactly_ what you want and ask for _exactly_ what you want without ambiguity, like in any other kind of programming. The difference is that the LLM is a heck of a lot quicker at typing code than you are.
by ZaoLahma
3/4/2026 at 11:32:03 AM
Honestly, unit tests (at least on the front-end) are largely wasted time in the current state of software development. Taking the time that would have been spent on writing unit tests and instead using it to write functionally pure, immutable code would do much more to prevent bugs.There's also the problem that when stack rank time comes around each year no one cares about your unit tests. So using AI to write unit tests gives me time to work on things that will actually help me avoid getting arbitrarily fired.
I wish that software engineers were given the time to write both clean code and unit tests, and I wish software engineers weren't arbitrarily judged by out of touch leadership. However, that's not the world we live in so I let AI write my unit tests in order to survive.
by harimau777
3/4/2026 at 11:50:00 AM
You are overvaluing “clean code.” Code is code, it either works within spec or it doesn’t; or, it does but there are errors, more or less catastrophic, waiting to show themselves at any moment. But even in that latter case, no single individual can know for certain, no matter how much work they put in, that their code is perfect. But they can know its useable, and someone else can check to make sure it doesn’t blow something else up, and that is the most important thing.by DiscourseFan
3/4/2026 at 11:49:56 AM
I like unit tests when I have to modify code that someone made years ago, as a basic sanity check.by msh
3/4/2026 at 5:20:11 AM
Yeah this is the exact kind of ridiculousness I've noticed as well - everything that comes out of an LLM is optimized to give you what you want to hear, not what's correct.by IAmGraydon
3/4/2026 at 8:11:39 AM
> The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the codeThis is true for humans too. Tests should not be written or performed by the same person that writes the code
by mrighele
3/4/2026 at 8:43:45 AM
That's a complete fantasy world where companies have twice the engineers they actually need instead of half.by nly
3/4/2026 at 9:40:23 AM
> [Reviews] should not be written or performed by the same person that writes the code> That's a complete fantasy world where companies have twice the engineers they actually need instead of half.
by missingdays
3/4/2026 at 11:41:35 AM
Agreed, but then companies shouldn't complain about the consequences of understaffing their teams.by harimau777
3/5/2026 at 8:03:21 PM
Mwahahahahaha! Suffer, devs, SUFFER! KNOW MY PAIN!Ah hem... Welcome to the wonderful world of Quality Assurance, software developing audience. That part of the job, after you yeet your code over the fence, where the job is to bridge the gap between your madness, and the madness of the rest of the business. Here you will find: frustration, an ever present sense the rest of the world is just out to make your life more difficult, a creeping sense of despair, a hot ice pick in the back of your mind every time the language model does something syntactically valid, but completely nonsensical in the real world, the development of an ever increasing time horizon over which you can accurately predict the future, but no one will believe you anyway, a smoldering hatred of the overly confident executive with an over developed capacity for risk tolerance; a desire to run away and start a farm, and finally, a fundamental distrust of everything software, and all the people who write it.
Don't forget your complimentary test framework and swag bag on your way out, and remember, you're here forever. You can try to check out, but you can never leave.
by salawat
3/4/2026 at 1:24:02 PM
My only hope is that all of this push leads in the end to the adoption of more formal verification languages and tools.If people are having to specify things in TLA+ etc -- even with the help of an LLM to write that spec -- they will then have something they can point the LLM at in order for it to verify its output and assumptions.
by cmrdporcupine
3/4/2026 at 11:26:44 AM
Long time ago in France the mainstream view by computer people was that code or compute weren't what's important when dealing with computers, it is information that matters and how you process it in a sensible way (hence the name of computer science in French: informatique. And also the name for computer: “ordinateur”, literally: what sets things into order).As a result, computer students were talked a lot (too much for most people's taste, it seems) about data modeling and not too much about code itself, which was viewed as mundane and uninteresting until the US hacker culture finally took over in the late 2000th.
Turns out that the French were just right too early, like with the Minitel.
by littlestymaar
3/4/2026 at 11:53:14 AM
"Computer science is no more about computers than astronomy is about telescopes." -Dijkstraby msh
3/4/2026 at 7:23:36 AM
> The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code.I always felt like that's the main issue with unit testing. That's why I used it very rarely.
Maybe keeping tests in the separate module and not letting th Agent see the source during writing tests and not letting agent see the tests while writing implemntation would help? They could just share the API and the spec.
And in case of tests failing another agent with full context could decide if the fix should be delegated to coding agent or to testing agent.
by scotty79
3/4/2026 at 4:06:50 AM
This hits hard. I’m getting hit with so much slop at work that I’ve quietly stopped being all that careful with reviews.by bentobean
3/4/2026 at 6:47:20 AM
>LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code. At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").You can use spec driven development and TDD. Write the tests first. Write failing code. Modify the code to pass the tests.
by DeathArrow
3/4/2026 at 4:30:19 AM
Um, you're supposed to write the tests first. The agents can't do this?by SoftTalker
3/4/2026 at 7:45:00 AM
Actually, they extremely bad at that. All training data contains cod + tests, even if tests where created first. So far, all models that I tried failed to implement tests for interfaces, without access to actual code.by alexsmirnov
3/4/2026 at 4:38:14 AM
They can, but should be explicitly told to do that. Otherwise they just everything in batches. Anyway pure TDD or not but tests catches only what you tell AI to write. AI does not now what is right, it does what you told it to do. The above problem wouldn’t be solved by pure TDD.by daliusd
3/4/2026 at 10:38:23 AM
> I have no constructive recommendations; I feel the industry will keep their foot on the pedal until something catastrophic happensI can't wait. Maybe when shitty vibe coded software starts to cause real pain for people we can return to some sensible software engineering
I'm not holding my breath though
by bluefirebrand