5/15/2026 at 12:18:15 AM
I have generally moved from bearish to bullish on the future of current AI technology, but the continued inaccuracy with basic facts all while the models significantly improve continues to give me significant pause.As an example, creating recipes with Claude Opus based on flavor profiles and preferences feels magical, right up until the point at which it can't accurately convert between tablespoons and teaspoons. It's like the point in the movie where a character is acting nearly right but something is a bit off and then it turns out they're a zombie and going to try to eat your brain. This note taking example feels similar. It nearly works in some pretty impressive ways and then fails at the important details in a way that something able to do the things AI can allegedly do really shouldn't.
It's these failures that make me more and more convinced that while current generation AI can do some pretty cool things if you manage it right, we're not actually on the right track to achieve real intelligence. The persistence of these incredibly basic failure modes even as models advance makes it fairly obvious that continued advancement isn't going to actually address those problems.
by rainsford
5/15/2026 at 3:39:36 AM
Yup, spot on. There's a capability-reliability gap that the industry does not like to talk about too much.It often feels like the AI industry is continually glossing over the fact that capability and reliability are fundamentally different qualities. We tend to use "accurate" and "reliable" interchangeably, but they describe different things. A model can ace a benchmark (capability/accuracy) and still be a liability in production (reliability).
Just look at recent reactions to yet another release from METR showing improved capabilities. But the less talked about part is how their measure is for a 50% success rate (and the even lesser talked about secondary measure they have at 80% success rate has a drastically lower time-horizon for tasks). https://metr.org/
I implement AI systems for enterprises and I don't know any that would ever be okay with 80% reliability (let alone 50%).
by cootsnuck
5/15/2026 at 5:37:21 AM
This capability-reliability gap (excellent term btw, more people need to think in these terms or we'll be in real trouble) is also infecting LLM assisted outputs. I just tried VSCode again tonight after a ~3yr hiatus and goddamn has it deteriorated. Lots of new features, lots of interesting looking plugins, but 3 out of the 5 plugins I tried for code CAD (the reason I downloaded VSCode again at all) were completely unusable--like couldn't even be made to work at all--and the other two didn't do anything like what they claimed. Also VSCode itself got into some kind of spastic loop trying to log me into github, and seemed incapable of recognizing the virtual environment in a python project's workspace... It also feels like the UI got even slower. This situation is bad.by jcgrillo
5/15/2026 at 8:32:40 AM
Your analogy reminds of messed up fingers and hands in image generation models just a year ago. Now that is pretty much solved. These days they are generating videos you can't tell apart from reality. This makes me believe these nuances will keep reducing and eventually become very hard to notice and find in may be every task.by smusamashah
5/15/2026 at 8:25:04 AM
Yesterday I was using opus 4.6 through copilot (don't ask...) to rubber-duck-brainstorm a big feature that needs a lot of care.I got some inspiration from it but it misinterpreted very basic stuff. might be a skill issue on my side, I do not know.
by igleria
5/15/2026 at 12:39:09 AM
> we're not actually on the right track to achieve real intelligence.Real intelligence means you have to say "I don't know" when you don't know, or ask for help, or even just saying you refuse to help with the subtext being you don't want to appear stupid.
The models could ostensibly do this when it has low confidence in it's own results but they don't. What I don't know if it's because it would be very computationally difficult or it would harm the reputation of the companies charging a good sum to use them.
by themafia
5/15/2026 at 9:22:09 AM
> Real intelligence means you have to say "I don't know" when you don't knowI have met many supposedly intelligent, certainly high status, humans who don't appear to be able to do that either.
I have more confidence we can train AIs to do it, honestly.
by vintagedave
5/15/2026 at 1:15:21 AM
That's just not how they work, really. They don't know what they don't know and their process requires an output.I think they're getting better at it, but it's likely just the number of parameters getting bigger and bigger in the SOTA models more than anything.
by cmrdporcupine
5/15/2026 at 1:52:54 AM
They do know what they don't know. There's a probability distribution for outputs that they are sampling from. That just isn't being used for that purpose.by adastra22
5/15/2026 at 5:12:44 AM
Common misconception. As far we know, LLMs are not calibrated, i.e. their output "probabilities" are not in fact necessarily correlated with the actual error rates, so you can't use e.g. the softmax values to estimate confidence. It is why it is more accurate to talk about e.g. the model "logits", "softmax values", "simplex mapping", "pseudo-probabilities", or even more agnostically, just "output scores", unless you actually have strong evidence of calibration.To get calibrated probabilities, you actually need to use calibration techniques, and it is extremely unclear if any frontier models are doing this (or even how calibration can be done effectively in fancy chain-of-thought + MoE models, and/or how to do this in RLVR and RLHF based training regimes). I suppose if you get into things like conformal prediction, you could ensure some calibration, but this is likely too computationally expensive and/or has other undesirable side-effects.
EDIT: Oh and also there are anomaly detection approaches, which attempt to identify when we are in outlier space based on various (e.g. distance) metrics based on the embeddings, but even getting actual probabilities here is tricky. This is why it is so hard to get models to say they "don't know" with any kind of statistical certainty, because that information isn't generally actually "there" in the model, in any clean sense.
by D-Machine
5/15/2026 at 8:23:01 AM
I don't think it's that hard to get them to say "I don't know"I'm pretty sure they are actively trained to avoid it.
Besides, like, what would you do if you asked your $200/mo AI something and it blanked on you?
by plaguuuuuu
5/15/2026 at 8:59:57 AM
> I'm pretty sure they are actively trained to avoid it.I'm not sure who is doing what training exactly, but I can say that (inconsistently!) some of my attempts to get it to solve problems that have not yet actually been solved, e.g. the Collatz conjecture, have it saying it doesn't know how to solve the problem.
Other times it absolutely makes stuff up; fortunately for me, my personality includes actually testing what it says, so I didn't fall into the sycophantic honey trap and take it seriously when it agreed with my shower thoughts, and definitely didn't listen when it identified a close-up photo of some solanum nigrum growing next to my tomatoes as being also tomatoes.
> Besides, like, what would you do if you asked your $200/mo AI something and it blanked on you?
I'd rather it said "IDK" than made some stuff up. Them making stuff up is, as we have seen from various news stories about AI, dangerous.
by ben_w
5/15/2026 at 9:03:22 AM
It's not hard to get them to say "I don't know", and they will do so regularly. It's hard to get them to say "I don't know" reliably (i.e. to say it when they don't actually know and to not say it when they do know). And in general even for statements or tasks they do 'know' (i.e. normally get right), they will occasionally get wrong.by rcxdude
5/15/2026 at 5:22:40 AM
I don't know if we are talking past each other, but I don't think this conversation is about absolute probabilities? The question is about relative uncertainty, and the softmax values are just fine for that.It is too computationally expensive, which is why nobody does this for production inference. But there are alignment tools to extract out these latent-space probabilities for researchers in the frontier labs.
by adastra22
5/15/2026 at 5:26:58 AM
> The question is about relative uncertainty, and the softmax values are just fine for that.They really aren't, especially if you consider the chain of thought / recursive application case, and also that you can't even assume e.g. a difference of 0.1 in softmax values means the same relative difference from input to input, or that e.g. an 0.9 is always "extremely confident", and etc. You really have no idea unless you are testing the calibration explicitly on calibration data.
> But there are alignment tools to extract out these latent-space probabilities for researchers in the frontier labs
You can get embeddings: if you can get calibrated probabilities, you'll need to provide a citation, because this would be a huge deal for all sorts of applications.
by D-Machine
5/15/2026 at 5:33:03 AM
Relative probabilities. That means comparing 2+ alternatives, and we're only talking about the model's worldview, not objective reality. The math for that is relatively straightforward. "Yes" could be 0.9, and ok that means nothing. But If we artificially constraint outputs to "Yes" and "No", and calculate the softmax for Yes to be 0.7 and No to be 0.3, that does lead to a straightforward probability calculation. [Not the naïve calculation you would expect, because of how softmax is computed. But you can derive an equation to convert it into normalized probabilities.]And now I'm certain we're taking past each other. I'm not talking about calibrated probabilities at all. Just the notion of "how confident do I feel about this?" which is what I interpreted the question above to be about. You can get that out of an LLM, with some work.
by adastra22
5/15/2026 at 5:40:10 AM
> But If we artificially constraint outputs to "Yes" and "No", and calculate the softmax for Yes to be 0.7 and No to be 0.3, that does lead to a straightforward probability calculation. [Not the naïve calculation you would expect, because of how softmax is computed. But you can derive an equation to convert it into normalized probabilities.]There is nothing straightforward about this, and no, there is no such formula.
> I'm not talking about calibrated probabilities at all. Just the notion of "how confident do I feel about this?"
If all you care about is vibes / feels, sure. If you actually need numerical guarantees and quantitative estimates to make your "feelings" about confidence mean something to rigorously justify decisions, you need calibration. If you aren't talking about calibration in these discussions, you are missing probably the most core technical concept that addresses these issues seriously.
by D-Machine
5/15/2026 at 5:50:17 AM
We're talking about artificial intelligence. Making computers think the way people do. People are are notoriously miscalibrated on their own self-assessed probabilities too.Finding a way to objectively calibrate a sense of "how confident do I feel about this?" would be fantastic. But let's not move goal posts. It would still be incredibly useful to have a machine that can merely matches the equivalent statement of confidence or uncertainty that a human would assign to their mental model, even if badly calibrated.
by adastra22
5/15/2026 at 5:55:24 AM
IMO it is you who are moving the goalposts, most likely in an attempt to hide the fact you were unaware of calibration before this discussion.> It would still be incredibly useful to have a machine that can merely matches the equivalent statement of confidence or uncertainty that a human would assign to their mental model, even if badly calibrated.
If human feelings are badly calibrated, they are useless here too, so no, I don't agree. Things like "confidence" only matter if they are actually tied to real outcomes in a consistent way, and that means calibration.
by D-Machine
5/15/2026 at 2:20:36 AM
I’m not clear what you mean by “know.” If you mean “the information is in the model” then I mostly agree, distributional information is represented somewhere. But if you mean that a model can actually access this information in a meaningful and accurate way—say, to state its confidence level—I don’t think that’s true. There is a stochastic process sampling from those distributions, but can the process introspect? That would be a very surprising capability.by raddan
5/15/2026 at 2:30:01 AM
yes:> In this experiment, however, the model recognizes the injection before even mentioning the concept, indicating that its recognition took place internally.
by kneyed
5/15/2026 at 3:35:15 AM
Having a probability distribution to sample from is not the same thing is knowing, because they don’t know anything about the provenance of the data that was used to build the distribution. They trust their training set implicitly by construction. They have no means to detect systematic errors in their training set.by chongli
5/15/2026 at 5:27:28 AM
You are talking about something different. If I ask you a yes/no question, and then ask you how certain you are, the answer you give is not an objective measurement of how likely you are to be right. You don't have access to that either. If you say "I'm very confident" or "Maybe 50/50" -- that is an assessment of your own internal weighted evidence, which is the equivalent of an LLM's softmax distribution.by adastra22
5/15/2026 at 2:45:18 AM
Well, with thinking models, it’s not that simple. The probability distribution is next token. But if a model thinks to produce an answer, you can have a high confidence next token even if MCMC sampling the model’s thinking chain would reveal that the real probability distribution had low confidence.by dhampi
5/15/2026 at 2:08:33 AM
Oh, you mean somewhere it is tracking the statistical likelihood of the output. Yeah I buy that, although I think it just tends towards the most likely output given the context that it is dragging along. I mean it wouldn’t deliberately choose something really statistically unlikely, that’s like a non sequitur.by Isamu
5/15/2026 at 3:07:59 AM
Well, it's not tracking. As it predicts each token it is sampling from a probability distribution -- that's what the matrix multiplies are for. It gets a distribution over all tokens and then picks randomly according to that distribution. How flat or how spiky that distribution is tells you how confident it is in its answer.But it then throws that distribution away / consumes it in the next token calculation. So it's not really tracking it per se.
by adastra22
5/15/2026 at 2:19:34 AM
From its point of view what does it mean "to know".Is it the token (or set of tokens) that are strictly > 50% probable or is it just the highest probability in a set of probabilities?
While generating bullshit is not ideal for a lot of use cases you don't want your premier chat bot to say "I don't know" to the general public half the time. The investment in these things requires wide adoption so they are always going to favour the "guesses".
by tempest_
5/15/2026 at 1:43:38 AM
You can just tell the agent to do exactly thatby wagwang
5/15/2026 at 6:28:59 AM
Except you can't be sure it isn't producing nonsense when you do this, and generally the model(s) will be overconfident. This has been studied, see e.g. https://openreview.net/pdf?id=E6LOh5vz5x > An alternative way to obtain uncertainty estimates from LLMs is to prompt them directly. One benefit of this approach is that it requires no access to the internals of the model. However, this approach has produced mixed results: LLMs can sometimes verbalize calibrated confidence levels (Lin et al., 2022a; Tian et al., 2023), but can also be highly overconfident (Xiong et al., 2024). Interestingly, Xiong et al. (2024) found that LLMs typically state confidence values in the range of 80-100%, usually in multiples of 5, potentially in imitation of how humans discuss confidence levels. Nevertheless, prompting strategies remain an important tool for uncertainty quantification, along with measures based on the internal state (such as MSP).
by D-Machine
5/15/2026 at 2:14:56 AM
I've had various agents backed by various models ignore the shit out of various rules and request at varying rates but they all do it.When you point it out "Oh yes, I did do that which is contrary to the rules, request <whatever>.. Anyway..."
by tempest_
5/15/2026 at 4:49:08 AM
If you are on a sota model and your context window is less than 100k tokens and you don't have any vague or contradicting rules, then I've almost never seen a rule brokenThe most common failure I've seen come from tools that pollute their context with crap and the llm will forget stuff or just get confused from all the irrelevant sentences; which if the report is true, is probably what these ai notetakers are guilty of. This problem gets exacerbated if these tools turn on the 1M context window version.
by wagwang
5/15/2026 at 2:08:11 AM
>You can just tell the agent to do exactly thatYou can.
It just won't do it.
by alterom
5/15/2026 at 4:45:58 AM
Seems to work for mehttps://chatgpt.com/share/6a06a4c5-d454-83e8-a5b2-c9468f6588...
by wagwang
5/15/2026 at 12:52:25 AM
My theory is because the people building the models and in charge of directing where they go love the sycophantic yes-man behavior the models displayThey don't like hearing "I don't know"
by bluefirebrand
5/15/2026 at 12:56:00 AM
You can TELL the models to do this and they'll follow your prompt."Give me your answer and rate each part of it for certainty by percentage" or similar.
by colechristensen
5/15/2026 at 1:19:17 AM
could you please tell me how it generates that certainty score?by mylifeandtimes
5/15/2026 at 1:53:11 AM
Vibes.by adastra22
5/15/2026 at 1:37:57 AM
The whole thing is a statistical model, that's just what it is. No, I cannot in a reasonable way dissect how an LLM works to a satisfactory level to a skeptic.by colechristensen
5/15/2026 at 2:12:22 AM
He's not a skeptic, he's asking you to explicitly state your reasoning with the expectation that either the readers will learn something or (more likely) you will realize that your thought and speech pattern there was the equivalent of an LLM hallucinating. Yes you can prompt it as you suggested and yes you will generally receive a convincing answer but it is not doing what you seem to think it is doing ie the generated rating is complete bullshit that the model pulled out of its proverbial ass.by fc417fc802
5/15/2026 at 3:48:39 AM
are you actually curious or do you just want to argue against it?by colechristensen
5/15/2026 at 4:10:02 AM
I think you're obviously wrong (based on my relatively detailed but certainly somewhat out of date and not expert level knowledge of LLM internals) but if you're willing to explain your reasoning I'm willing to reconsider my own position in light of any new information or novel observations you might provide.by fc417fc802
5/15/2026 at 5:24:18 AM
GP is obviously wrong, and probably doesn't know about calibration and/or that it isn't even clear how to calibrate frontier models in the manner we need, given how complex and expensive the training is, and how tricky calibration becomes in e.g. mixture-of-experts and chain of thought approaches.by D-Machine
5/15/2026 at 8:34:08 AM
I suspect that introducing the calibration concept might be a case of too much too soon for some people.As far as I understand it, the various probability matrices boil down to: what token has the highest likelihood of coming next, given this set of input tokens. Which then all gets chucked away and rebuilt when the most likely token is appended to the input set.
Objective assessment of internal state - again, to my non-expert eye - doesn’t appear to have any way to surface to me.
Big-if my rough working understand is more or less correct - your calibration point makes a lot of sense to me. I’m not sure that it would make sense to someone who eg considers some form of active thinking process that is intellectualising about whether to output this or that token.
by mootothemax
5/15/2026 at 3:55:37 AM
"I can only explain my beliefs to people who promise they'll agree" is certainly a unique take.by clipsy
5/15/2026 at 1:45:13 AM
It's a statistical model for words and sentences, not knowledge. What does the LLM knows about having a pebble in your shoes, or drinking a nice cup of coffee?by skydhash
5/15/2026 at 12:51:29 AM
I hate to help provide possible soultions to an entire process I don't approve of, but maybe the fuzzy tools need old style deterministic tools the same way and for the same reasons we do.So instead of an LLM trying to answer a math or reason question by finding a statistical match with other similar groups of words it found on 4chan and the all in podcast and a terrible recipe for soup written by a terrible cook, it can use a calculator when it needs a calculator answer.
by Brian_K_White
5/15/2026 at 3:31:40 AM
They absolutely need deterministic tools. What you just described is exactly how the current popular AI agents work. They use "harnesses", which to me is just a rebranding of what we have known all along about building useful and reliable software...composable orchestrated systems with a variety of different pieces selected based on their capabilities and constraints being glued together for specific outcomes.It just feels like for some reason this is all being relearned with LLMs. I guess shortcuts have always been tempting. And the idea of a "digital panacea" is too hard to resist.
by cootsnuck
5/15/2026 at 3:47:04 AM
Doesn't agentic AI do this? I've got AI running in VS Code. If I ask it for something, it can fill a code cell with a little bit of Python, and then run it with my approval. It's using the Python interpreter on my computer as a calculator.by analog31
5/15/2026 at 2:09:34 AM
I think that is how the smarter agents do things? Just like Claude/ChatGPT sometimes does a web search they can do other tool calls instead of just making a statistical guess. Of course it doesn’t always make the bright choice between those options though…by stevula
5/15/2026 at 3:20:00 AM
They will also lie and produce output saying it is based on tool execution, without having actually used the tool.Yes, another layer to cross-check, say, “in kubectl logs I see …” with an actual k8s tool call can help, that is, when the cross-check layer doesn’t lie.
For the time being, IMHO, human validation in key points is the only way to get good results. This is why the tools make experienced people potentially a lot more efficient (they are quick to spot errors/BS) and inexperienced people potentially more dangerous (they’re more prone to trusting the responses, since the tone is usually very professionally sounding).
by fipar
5/15/2026 at 2:15:10 AM
> it doesn’t always make the bright choiceI'm available for a small fee.
by WalterBright
5/15/2026 at 3:11:58 AM
You must be living in absolute opulence :)by sgc
5/15/2026 at 3:22:16 AM
That’s exactly how all the current cloud chat bots and agents work now.by epcoa
5/15/2026 at 12:54:08 AM
No, they just need to be trained to have adversarial self review "thinking" processes.You ask an LLM "What's wrong with your answer?" and you get pretty good results.
by colechristensen
5/15/2026 at 12:59:29 AM
Or you get the original output result was perfect and the adversarial "rethinking" switches to an incorrect result.by binary0010
5/15/2026 at 2:12:34 AM
this seems to happen far more than i would likeby byzantinegene