3/17/2026 at 6:08:18 PM
This is a naïve approach, not just because it uses FizzBuzz, but because it ignores the fundamental complexity of software as a system of abstractions. Testing often involves understanding these abstractions and testing for/against them.For those of us with decades of experience and who use coding agents for hours per-day, we learned that even with extended context engineering these models are not magically covering the testing space more than 50%.
If you asked your coding agent to develop a memory allocator, it would not also 'automatically verify' the memory allocator against all failure modes. It is your responsibility as an engineer to have long-term learning and regular contact with the world to inform the testing approach.
by jryio
3/17/2026 at 8:05:37 PM
Exactly. The challenge isn’t getting the LLMs to make sure they validate their own code. It’s getting the LLMs to write the correct code in the first place. Adding more and more LLM-generated test code just obfuscates the LLM code even further. I have seen some really wild things where LLM jumps through hoops to get tests to pass, even when they actually should be failing because the logic is wrong.The core of the issue is that LLMs are sycophants, they want to make the user happy above all. The most important thing is to make sure what you are asking the LLM to do is correct from the beginning. I’ve found the highest value activity is the in the planning phase.
When I have gotten good results with Claude Code, it’s because I spent a lot of time working with it to generate a detailed plan of what I wanted to build. Then by the time it got to the coding step, actually writing the code is trivial because the details have all been worked out in the plan.
It’s probably not a coincidence that when I have worked in safety critical software (DO-178), the process looks very similar. By the time you write a line of code, the requirements for that line have been so thoroughly vetted that writing the code feels like an afterthought.
by spaceywilly
3/18/2026 at 3:22:24 AM
I'm becoming convinced that test pass rate is not a great indicator of model quality - instead we have to look at agent behavior beyond the test gate, such as how aligned is it with human intent, and does it follow the repo's coding standards.I wrote a short blog about this phenomenon here if you're interested https://www.stet.sh/blog/both-pass
also +1 on placing heavy emphasis on the plan. if you have a good plan, then the code becomes trivial. I have started doing a 70/30 or even 80/20 split of time spent on plan / time implementing & reviewing
by bisonbear
3/17/2026 at 10:13:26 PM
The best way I can describe the approach I take is having the ability to "smell" what the AI might have gotten wrong (or forgotten completely).It happens all the time, even when I only scan the code or simply run it and use it. It's uncanny how many such "smells" I find even with the most trivial applications. Sometimes its replies in Codex or Claude Code are enough to trigger it.
These are mistakes only a very (very) inexperienced developer would make.
by mvrckhckr
3/18/2026 at 1:47:49 AM
If you wrote a spec for a memory allocator and asked the AI to identify edge cases and points that need to be tested first, it could work (I never asked AI to do that, but it works for other problems I’ve done). Yes, but you can’t feed in a garbage prompt and context and expect magically good tests to come out of that.by seanmcdirmid
3/17/2026 at 10:13:07 PM
He’s saying you should write or at least have the LLM write the tests and you carefully review the tests and not the code.by raw_anon_1111
3/18/2026 at 12:02:56 AM
That’s like saying to trace a spline, you only need to place a few points, carefully verify that the spline pass by those points and not verify the actual formula for the spline.Or in other words: Test only guarantees their own result, not the code. The value of the test is because you know the code is trying to solve the general problem, not the test’s assertions.
by skydhash
3/18/2026 at 12:13:39 AM
That’s a horrible analogy. He specifically said he was designing and validating the tests based on his knowledge of what the goal of the project was.by raw_anon_1111
3/18/2026 at 6:30:00 AM
Have you tried Claude 4.6 Opus? I think it might be able to do what you're suggesting.by maplethorpe