1/31/2026 at 12:10:57 PM
> If you ask AI to write a document for you, you might get 80% of the deep quality you’d get if you wrote it yourself for 5% of the effort. But, now you’ve also only done 5% of the thinking.This, but also for code. I just don't trust new code, especially generated code; I need time to sit with it. I can't make the "if it passes all the tests" crowd understand and I don't even want to. There are things you think of to worry about and test for as you spend time with a system. If I'm going to ship it and support it, it will take as long as it will take.
by kranner
1/31/2026 at 1:35:12 PM
Yep, this is the big sticking point. Reviewing code properly is and was the bottle neck. However, with humans I trusted, I could ignore most of their work and focus on where they knew they needed a review. That kind of trust is worth a lot of money and lets you move really fast.> I need time to sit with it
Everyone knows doing the work yourself is faster than reviewing somebody elses if you don’t trust them. I’d argue if AI ever gets to the point where you fully trust it, all white collar jobs are gone.
by jdjdjssh
1/31/2026 at 1:00:08 PM
Yes, regression tests are not enough. One generally has to think through code repeatedly, with different aspects in mind, to convince oneself that it is correct under all circumstances. Tests only point-check, they don’t ensure correct behavior under all conceivable scenarios.by layer8
1/31/2026 at 4:21:31 PM
Unless you are in the business of writing flight control software, OS kernels, or critical financial software, I don't think your own code will reach the standards you mention. The only way we get "correct under all conceivable scenarios" software is to have a large team with long time horizons and large funding working on a small piece of software. It is beyond an individual to reach that standard for anything beyond code at the function level.by doug_durham
2/1/2026 at 10:33:26 AM
I personally find that it's easier to iterate on 'something' and find what's wrong with it and fix it, than theorycrafting staring at a blank paper.by torginus
2/1/2026 at 12:20:10 PM
Me too! Sometimes I find it easier to ask an LLM to generate something and then it's easier to discard the whole thing, muttering under my breath "no no no that's not how you do this" and write my own code. It would be nice if editors would annotate generated code clearly.by kranner
2/1/2026 at 7:07:05 AM
I'll borrow your phrasing lest another coworker decide to play therapist. Previously: "I'm not slinging drugs here, I care for myself and the users."by bravetraveler
1/31/2026 at 1:27:53 PM
Honest question: why is this not enough?If the code passes tests, and also works at the functionality level - what difference does it make if you’ve read the code or not?
You could come up with pathological cases like: it passed the tests by deleting them. And the code written by it is extremely messy.
But we know that LLMs are way smarter than this. There’s very very low chance of this happening and even if it does - it quick glance at code can fix it.
by simianwords
1/31/2026 at 1:50:59 PM
You can't test everything. The input space may be infinite. The app may feel janky. You can't even be sure you're testing all that can be tested.The code may seem to work functionally on day 1. Will it continue to seem to work on day 30? Most often it doesn't.
And in my experience, the chances of LLMs fucking up are hardly very very low. Maybe it's a skill issue on my part, but it's also the case that the spec is sometimes discovered as the app is being built. I'm sure this is not the case if you're essentially summoning up code that exists in the test set, even if the LLM has to port it from another language, and they can be useful in parts here and there. But turning the controls over to the infinite monkey machine has not worked out for me so far.
by kranner
1/31/2026 at 3:22:51 PM
If you care about performance, test it (stress test).If you care about security, test it (red teaming).
If you care about maintainability, test it (advanced code analysis)
Your eyeballs are super fallible, this is why bad engineers exist. Get rigorous.
by CuriouslyC
2/1/2026 at 7:13:42 AM
They are so rigorous they want to know the product and test [effectively]. Catch up, testing is also super fallible. Tests will fail, either directly or in spirit.I'll leave with a teaser: are you testing what you think you are? Is it relevant? What do, after; buy more tokens? Hope it's worth it, enjoy the slot machine. I find it a little loud.
by bravetraveler
1/31/2026 at 3:41:40 PM
Maybe that works for industrialised production of completely well defined software. I don't think it leads to any kind of creative output.by kranner
2/1/2026 at 12:45:37 AM
i feel this totally ignored the point of infinite input space. you only providing 3 scenarios and eye balls rigorous comment is either hilariously patronizing or ironically self aggrandizing.by memonkey
1/31/2026 at 4:30:56 PM
Good question. Several reasons.1. Since the same AI writes both the code and the unit tests, it stands to reason that both could be influenced by the same hallucinations.
2. Having a dev on call reduces time to restore service because the dev is familiar with the code. If developers stop reviewing code, they won't be familiar with it and won't be as effective. I am currently unaware of any viable agentic AI substitute for a dev on call capability.
3. There may be legal or compliance standards regarding due diligence which won't get met if developers are no longer familiar with the code.
I have blogged about this recently at https://www.exploravention.com/blogs/soft_arch_agentic_ai/
by gengstrand
1/31/2026 at 1:37:45 PM
> If the code passes tests, and also works at the functionality levelWhy doesn’t outsourcing work if this is all that is needed?
by jdjdjssh
1/31/2026 at 1:50:38 PM
We haven’t fully proven that it is any different. Not at scale anyway. It took a decade for the seams of outsourcing to break.But I have a hypothesis.
The quality of the output, when you don’t own the long term outcome or maintenance, is very poor.
This is not the case with AI in the same sense it is with human contractors.
by jmathai
1/31/2026 at 1:50:31 PM
Why do we have managers if managers don’t have accountability?by simianwords
1/31/2026 at 2:20:52 PM
I’m not sure what you’re getting at. I’m saying there’s a lot more to creating useful software than “tests pass / limited functionality checks work” from a purely technical perspective.by jdjdjssh
1/31/2026 at 1:38:37 PM
It depends on the scale of complexity you’re working at and who your users are going to be. I’ve found that it’s trivial to have Claude Code spit out so much functionality that even just proper manually verifying it becomes a gargantuan task. I end up just manually testing the pieces I’m familiar with which is fine if there’s a QA department who can do a full run through of the feature and are prepared to deal with vibe coding pitfalls, but not so much on open source projects where slop gets shipped and unfamiliar users get stuck with bugs they can’t possibly troubleshoot. Writing the code from scratch The Old Way™ leaves a lot less room for shipping convincing but non functional slop because the dev has to work through it before shipping.The most immediate example I can think of is the beans LLM workflow tracker. It’s insane that its measured in the 100s of thousands of LoC and getting that thing setup in a repo is a mess. I had to use Github copilot to investigate the repo to get the latest method. This wouldn’t fly at my employer but a lot of projects are going to be a lot less scrupulous.
You can see the effects in popular consumer facing apps too: Anthropic has drunk way too much of its own koolaid and now I get 10-50% failure rates on messages in their iOS app depending on the day. Some of their devs have publicly said that Claude writes 100% of their code and its starting to show. Intermittent network failures and retries have been a solved problem for decades, ffs!
by throwup238
1/31/2026 at 12:41:39 PM
I think what LLMs do with words is similar to what artists do with software like cinema4d.We have control points (prompts + context) and we ask LLMs to draw a 3D surface which passes through those points satisfying some given constraints. Subsequent chats are like edit operations.
by slfreference
1/31/2026 at 1:07:33 PM
An LLM is an impressive, yet still imperfect and unpredictable translation machine. The code it outputs can only be as good as your prompt is precise, minus the often blatant mistakes it makes.by catdog
1/31/2026 at 3:20:07 PM
You're countering vibes with vibes.If the tests aren't good enough, break them. Red team your own software. Exploit your systems. "Sitting with the code" is some Henry David Thoreau bullshit, because it provides exactly 0 value to anyone else, whereas red teamed exploits are objective.
by CuriouslyC
1/31/2026 at 3:36:30 PM
The way you come up with ideas on how to break, red team and exploit; when to do this and how to stop: that part is not objective. The machine can't do this for you sufficiently well. There is a subjective process in there that you're not acknowledging.It's a good approach! It's just more 'negative space' than direct.
by kranner
1/31/2026 at 4:10:45 PM
People who pentest spend more time running a playbook than puzzling over the logical problem of how to break a piece of software. Even a lot of zero days are more about knowing a pattern and mass scanning for it across a lot of code than playing chess vs a codebase and winning.by CuriouslyC
1/31/2026 at 4:14:47 PM
Fine, but is that the entirely of software development? It even seems a waste of time by your own reasoning if it's so automatable already.by kranner
1/31/2026 at 4:55:49 PM
You're over-rotating on security. Not that it isn't important, but there are other dimensions to software that benefit heavily from the author having a deep understanding of the code that's being created.by nkohari