12/31/2025 at 2:25:59 PM
I'd be curious in how well it passes 100th Coin's NES accuracy tests https://github.com/100thCoin/AccuracyCoinby worble
12/31/2025 at 2:37:13 PM
Indeed, that's what I kind of hinted at in https://news.ycombinator.com/item?id=46442195 and coincidentally https://news.ycombinator.com/item?id=46437688 briefly after, namely that OK, one can "generate" a "solution", that's much easier than before... but until we can verify somehow that it actually does what it say it does (and we know of hallucinations and have no reason to believe this changed) then testing itself, especially of well know "problems" is more and more important.That being said, it doesn't answer the "why" in the first place, an even more important question. At least though it does help somehow to compare with existing alternatives.
by utopiah
12/31/2025 at 3:34:17 PM
Isn’t this how all software development works? Folks commit code, it’s tested, and reviewed, and then deployed.Why would this be any different?
by garciasn
12/31/2025 at 3:53:05 PM
That's not how software development works.Folks think, they write code, they do their own localized evaluation and testing, then they commit and then the rest of the (down|up)stream process begins.
LLM's skip over the "actually verify that the code I just wrote does what I intended it to" step. Granted, most humans don't do this step as thoroughly and carefully as would be desirable (sometimes through laziness, sometimes because of a belief in (down|up)stream testing processes). But LLM's don't do it at all.
by PaulDavisThe1st
12/31/2025 at 4:00:21 PM
They absolutely can do that if you give them the tools. Seeing Claude (I use it with opencode agents) run curl and playwright to verify and then fix it's implementation was a real 'wow' moment for me.by sally_glance
12/31/2025 at 5:34:56 PM
We have different experiences. Often I’ll see Claude, et. al. find creative ways to fulfill the task without satisfying my intent, e.g., changing the implementation plan I specifically asked for, changing tolerances or even tests, and frequently disabling tests.by Q6T46nT668w6i3m
1/1/2026 at 4:47:33 AM
Yeah I feel that, if it happens your only way out is to write down a more extensive implementation plan first. For me that is the point where I start regretting to have tried implementing something using AI,.. But admittedly most of the time redacting the implementation plan and running the agent again is still faster than I could have done on my own (I try to make implementation tasks explicit in the form of a markdown file, worked pretty well so far).by sally_glance
12/31/2025 at 8:55:40 PM
I see these “you had a different experience than me” comments around AI coding agents a lot and can concur; I’ll have a different experience with Copilot from day-to-day even, sometimes it’s great and other days I give up on using it at all it’s being so bad.Makes me honestly wonder — will AGI just give us agents that get into bad moods and not want to work for the day because they’re tired or just don’t feel like it!
by Fr0styMatt88
12/31/2025 at 10:57:55 PM
If part of the goal is to emulate a person's abilities, then surely that includes a person's ability to fuck things up.by ssl-3
12/31/2025 at 7:48:05 PM
Are you a customer?by DANmode
12/31/2025 at 10:12:13 PM
Don’t downvote because you don’t like the question.It obviously adds to the discussion: paid and non paid accounts are being conflated daily in threads like these!
They’re not the same tier account!
Free users, especially ones deemed less interesting to learn from for the future, are given table-scraps when they feel it’s necessary for load reasons.
by DANmode
12/31/2025 at 11:11:15 PM
Exactly. There's an impedance mismatch between those using the free/cheap tiers and those paying a premium, so the discussion gets squirrely because one side is talking about apples and the other oranges.by nineteen999
1/1/2026 at 6:23:46 AM
Right.More specifically: One side is talking about apples,
and the other is talking about mushy old apples,
that sometimes you need to wait 12 hours for.
by DANmode
1/1/2026 at 1:24:01 AM
All user accounts are also customers. Some are paying with data and contributing to metrics going up.by baobun
1/1/2026 at 2:36:19 AM
That’s not how words work.All users are stakeholders.
They’re emphatically not considered customers.
We can disagree with that, create legal protections for those people - but that doesn’t make them customers to OpenAI, Anthropic, et al.
by DANmode
12/31/2025 at 4:03:15 PM
> LLM's skip over the "actually verify that the code I just wrote does what I intended it to" step.I'm not sure where this idea comes from. Just instruct it to write and run unit tests and document as it goes. All of the ones I've used will happily do so.
You still have to verify that the unit tests are valid, but that's still far less work than skipping them or writing the code/tests yourself.
by mapontosevenths
12/31/2025 at 8:42:33 PM
I disagree it's less work. It just carte blanche rewrites tests. I've seen it rewrite and rewrite tests to the point of undermining the original test intention. So now instead of intentionally writing code and a new unit test, I need to intentionally go and review EVERY unit test it touched. Every. Time.It also doesn't necessarily rewrite documentation as implementation changes. I've seen documentation code rot happen within the same coding session.
by butlike
12/31/2025 at 11:04:07 PM
I've seen it do that as well. Especially Gemini 3 lately.I've started to add an instruction to my GEMINI.md after I'm happy with the tests telling it not to edit them, but to still run them.
I solve the documentation issue the same way. By telling it when and what to update in the .md file.
by mapontosevenths
12/31/2025 at 4:35:57 PM
> actually verify that the code I just wrote does what I intended it toThat's what the author did when they ran it.
by jimmaswell
12/31/2025 at 4:33:04 PM
Claude Opus 4.5 will routinely test its own code before handing it off to you, even with zero instruction to do so.by adventured
12/31/2025 at 7:06:33 PM
One commercial equivalent to the project I work on, called ProTools (a DAW), has a test "harness" that took 6 people more than a year to write and takes more than a week to execute.Last month, I made a minor change to our own code and verified that it worked (it did!). Earlier this week, I was notified of an entirely different workflow that had been broken by the change I had made. The only sort of automated testing that would have detected this would have been similar in scope and scale to the ProTools test harness, and neither an individual human nor an LLM is going to run that.
Moreover, that workflow was entirely graphically based, so unless Claude Opus 4.5 or whatever today's flavor of vibe coding LLM agent is has access to a testing system that allows it to inject mouse events into a running instance of our application (hint: it does not), there's no way it could run an effective test for this sort of code change.
I have no doubt that Claude et al. can verify that their carefully defined module does the very limited task it is supposed to do, for cases where "carefully defined" and "very limited" are appropriate. If that's the only sort of coding you do, I am sorry for your loss.
by PaulDavisThe1st
12/31/2025 at 7:51:34 PM
> access to a testing system that allows it to inject mouse events into a running instance of our applicationFWIW that's precisely what https://pptr.dev is all about. To your broader point though designing a good harness itself remains very challenging and requires to actually understand what value for user, software architecture (to e.g. bypass user interaction and test the API first), etc.
by utopiah
12/31/2025 at 10:36:08 PM
> Puppeteer is a JavaScript library which provides a high-level API to control Chrome or Firefoxmy world is native desktop applications, not in-browser stuff.
by PaulDavisThe1st
12/31/2025 at 11:13:18 PM
You suggest a web testing framework as a response to someone working on a real desktop app?by nineteen999
1/1/2026 at 7:05:19 AM
No I was sharing an example of a framework that does include "a testing system that allows it to inject mouse events".That being said mouse events and similar isn't hard to do, e.g. start with a fixed resolution (using xrandr) then xdotool or similar. Ideally if the application has accessibility feature it won't be as finicky.
My point though was just to show that testing with GUI is not infeasible.
Apparently there is even a "UI Testing for devs & agents" https://www.chromatic.com which I found via Visual TDD https://www.chromatic.com/blog/visual-test-driven-developmen... I can't recommend this but it does show even though the person I was replying with can't use Puppeteer in their context the tooling does exist and the principles would still apply.
by utopiah
1/1/2026 at 4:42:19 PM
> My point though was just to show that testing with GUI is not infeasible.Indeed, which is why I mentioned the ProTools test harness and the fact that it took 6 people a year to write and takes a week to run (or took a week, at some point in the past; it might be more or less now).
by PaulDavisThe1st
12/31/2025 at 11:44:53 PM
Claude can do that, yes.https://platform.claude.com/docs/en/agents-and-tools/tool-us...
Although if you want to test a UI app, it's better to do it through accessibility APIs rather than actually looking at the screen and clicking.
by astrange
12/31/2025 at 3:05:50 PM
I’m sure you can point Claude at that page and have it make the necessary changes to pass.by roger_
12/31/2025 at 3:46:03 PM
Or it could loop infinitely, never quite being able to pass all the tests.by deadbabe
12/31/2025 at 9:50:33 PM
which is easily fixable by some human guidanceby hu3
1/1/2026 at 12:18:21 AM
Sorta, I went into this not really knowing how to implement an emulator: https://github.com/RAMJAC-digital/RAMBOWith the NES there are all sorts of weird edge cases, one of which are NMI flags and resets; the PPU in general is kinda tricky to get right. Claude has had *massive** issues with this, and I've had to take control and completely throw out code it's generated. I'm restarting it with a clean slate though, as there are still issues with some of the underlying abstractions. PPU is still the bane of my existence, DMA, I don't like the instruction pipeline, haven't even gotten to the APU. It's getting an 80/130 on accuracy coin.
Though, when it came to creating a WASM target, Claude was largely able to do it with minimal input on my end. Actually, getting the WASM emulator running in the browser was the least painful part of this project.
You will run into three problems: 1) "The Wall" when any project becomes large enough, you need the context window to be *very* specific and scoped, with explicit details of what is expected, the success criteria and deliverables. 2) Ambiguity means Claude is going to choose the path of least resistance, and will pedantically avoid/add things which are not specced. Stubs for functions, "beyond scope", "deferred" are some favorite excuses to not refactoring or implementing obvious issues (anything that will go beyond the context window, Claude knows, but won't tell you will be punted work). 3) Chat bots *loooove* to talk, it will vomit code for days. Removing code/documentation is anathema to Claude. "Backward compatibility", deprecated, and legacy being its favorite.
by RAMJAC
1/1/2026 at 3:51:53 PM
This sounds exhausting, once the thrill of seeing code rapidly generated wears off, I wonder if it's even worth it. If someone was going to use code they didn't write, why not just pull down some open source implementation from somewhere and build on top of it? It's basically gets you the same thing but without the LLM hassles, and you can start building on a more sane foundation.by deadbabe