4/17/2026 at 3:13:12 AM
Beware. I had Claude code with opus building boards and using spice simulations. It completely hallucinated the capabilities of the board and made some pretty crazy claims like I had just stumbled onto the secret hardware billion dollar project that every home needed.None of the boards worked and I had to just do the project in codex. Opus seemed too busy congratulating itself to realize it produced gibberish.
by iterateoften
4/17/2026 at 12:09:01 PM
This matches what I've seen too — the hallucination gets much worse when the loop has no external verifier. "Does this board work?" has no ground truth inside the model, so it defaults to optimistic narration.What OP is doing here is actually the mitigation: SPICE + scope readout is a verifier the model can't talk its way past. The netlist either simulates or it doesn't, the waveform either matches or it doesn't. That closes the feedback loop the same way tests close it for code.
The failure mode that remains, in my experience, is a layer down: when the verifier itself errors out (SPICE convergence failure, missing model card, wrong .include path), the agent burns turns "reasoning" about environment errors it has seen a hundred times.That's where most of the token budget actually goes, not the design work.
by ZihangZ
4/17/2026 at 12:12:54 PM
What throws me about this comment is the missing space between the period and the T in the last sentence.Did the model itself do that? Was it a paste error?
by jddj
4/17/2026 at 3:33:27 PM
I’ve also noticed Gemini and Claude occasionally mixing terms recently (eg revel vs reveal) and can’t decide whether it is due to cost optimization effects or some attempt to seem more human.I can’t recall either using a wrong word prior this month for some time.
by svnt
4/17/2026 at 3:47:17 PM
Or just because mistakes are part of the distribution that it's trained on? Usually the averaging effect of LLMs and top-k selection provides some pressure against this, but occasionally some mistake like this might rise up in probability just enough to make the cutoff and get hit by chance.I wouldn't really ascribe it to any "attempt to seem more human" when "nondeterministic machine trained on lots of dirty data" is right there.
by lambda
4/17/2026 at 3:56:46 PM
Sure, but if that were the case why has it gotten worse recently? I would expect it to be as a result of cost optimization or tradeoffs in the model. I suppose it could be an indicator of the exhaustion of high quality training data or model architecture limitation. But this specific example, revel vs reveal, is almost like going back to GPT-2 reddit errors.I also don’t want to pretend there is no incentive for AI to seem more human by including the occasional easily recognized error.
by svnt
4/17/2026 at 4:31:09 PM
Or just the models are getting bigger and better at representing the long tail of the distribution. Previously errors like this would get averaged away more often; now they are capable of modelling more variation, and so are picking up on more of these kinds of errors.by lambda
4/17/2026 at 7:33:35 PM
That makes sense, but what is the solution?by svnt
4/17/2026 at 4:35:38 PM
Looking at the account's other comment there are subtle grammatical errors in that one too.Would be good to see the prompt out of morbid curiosity
by jddj
4/17/2026 at 3:25:56 AM
I haven't tried it with codex yet. But my approach is currently a little bit different. I draw the circuit myself, which I am usually faster at than describing the circuit in plain english. And then I give claude the spice netlist as my prompt. The biggest help for me is that I (and Claude) can very quickly verify that my spice model and my hardware are doing the same thing. And for embedded programming, Claude automatically gets feedback from the scope and can correct itself. I do want to try out other models. But it is true, Claude does like to congratulate itself ;)by _fizz_buzz_
4/17/2026 at 10:48:12 AM
It's because you are holding it wrong!--courtesy for all the LLM pushers so they don't have to bother commenting on this one
by ezst
4/17/2026 at 1:10:36 PM
This week I tried to use Opus to analyse output from an oscilloscope and it was impossible to complete, because Python scripts (Opus wrote itself) were flagged for cyber security risk. Baffling.by varispeed