5/20/2026 at 1:43:19 PM
The non-hallucination rate in AA-omniscience is SOTA, better than Opus 4.7, Gemini 3.1 Pro and GPT5.5! Congrats to the teamby goldenarm
5/20/2026 at 2:21:09 PM
referencing this:https://artificialanalysis.ai/evaluations/omniscience?models...
(had to add it to the chart, wasn't displayed by default. is it the lowest rate in the datasetor no?)
by throawayonthe
5/20/2026 at 7:27:36 PM
This counts only incorrect answers though. A model can get 0% hallucination rate just by refusing to answer all questions.by jampekka
5/20/2026 at 8:21:49 PM
Isn't that precisely the reason why we introduced the term hallucination? Because llms have historically always made up bullshit of they cannot answer directly... If they now nailed this to maybe the model not respond instead of responding incorrectly, then a lot of previously unusable usecases would become feasible.So I feel like that's exactly the right metric and the way to track it wrt hallucinations.
by ffsm8
5/20/2026 at 10:41:36 PM
I had a buddy in high school that was notorious for doing the same thing. (He's now a senior director at a Big 4 consultancy. :) )by doublescoop
5/21/2026 at 6:09:58 AM
Do you mind expanding a little more?by rrgok
5/22/2026 at 9:21:53 AM
They had a buddy who used to lie a lot when they were younger… now they get paid for itby alfiedotwtf
5/21/2026 at 3:08:00 AM
The point is that it's not a useful metric on its own. For example, redirecting from /dev/null also achieves a zero hallucination rate.We want the hallucination rate to decrease while the overall answer rate of queries remains sufficiently high. For more specifics, look into ROC and AUC.
by akoboldfrying
5/20/2026 at 10:36:29 PM
I think that's what the Omniscience Index is for:https://artificialanalysis.ai/evaluations/omniscience#aa-omn...
It rewards correct answers and penalizes hallucinations, and finally no reward for refusing to answer.
It's interesting just how poorly some popular Chinese models fare in this regard, like GLM 5.1 or DeepSeek 4 Pro.
Gemini 3.x has truly remarkable knowledge given how it leads in this benchmark despite being (quite a bit) more prone to hallucinate than Claude Opus.
by jug
5/20/2026 at 8:31:54 PM
Yes. A model that can answer "I don't know" would be much more trustable than the current used car salesman we have now.by speed_spread
5/21/2026 at 10:27:52 AM
Models can answer "I don't know". Hallucination benchmarks, including this, give the models the option to "not attempt". It's just that the metric linked doesn't take into account the rate of correct answers at all. It has its uses in analyzing incorrect vs not attempted answers, but gives a very partial picture.by jampekka
5/20/2026 at 10:10:29 PM
Its very annoying this has been in the capability of models since the very beginning. It could check how probable its token values are and if those fall below a certain threshold either say "I don't know", or output the most probable (well, more like least improbable) tokens but give a very clear, very strong warning that it is a shot in the dark and likely to contain hallucinations.But no, Google and OpenAI would rather always have an answer ready and tell you to mix glue into your pizza toppings :)
by jorvi
5/21/2026 at 1:03:12 AM
It can't, because top n isn't always reliable.Hallucination detection is an open problem. If it were that simple, people would indeed "just" do it.
Basically the problem is that LLMs aren't trained on things they don't know; an alternative way of saying this is that they're not trained on things they're not trained on, which is obviously true.
When you RL a model and it answers incorrectly, you don't teach it to answer "I don't know", you teach it to answer correctly instead. This makes it very hard for it to realize when it doesn't know things.
by miki123211
5/21/2026 at 2:59:40 AM
Models tend to default to their training data even when they lack sufficient context, they've never been trained to recognize their own uncertainty, so they hallucinate confidently instead.by chengyongru
5/21/2026 at 12:24:25 AM
I don't have much to add other than this observation that we seem to have moved away from eating one small rock per day for nutritional value, and adding gasoline in spaghetti.The glue on pizza reference brought back memories :)
by tokenscoper
5/22/2026 at 2:29:29 PM
The probability of tokens is unfortunately a poor proxy for confidence because it is entirely possible for "mixing glue" to appear in a sentence about making pizza depending on context. It might even be likely if the user has asked the model to lie.by SR2Z
5/21/2026 at 12:18:13 AM
Yeah, I never understood why the top n statistics weren't included in the chat interfaces, to color the text!by nomel
5/21/2026 at 1:12:35 AM
> by refusing to answer all questions.Cool, precisely the thing other AI is too stupid to do when they don't have the necessary knowledge.
by aicantdeny
5/21/2026 at 6:01:20 AM
Yes, that's in fact precisely the desired behavior when a model doesn't know the answer.by Balinares
5/20/2026 at 3:34:03 PM
> The non-hallucination rate in AA-omniscience is SOTANote that a perfect "non-hallucination rate" is rather meaningless as such tests can contain human hallucinations.
It means the model aligns with the possibly-true, possibly-false beliefs of the group that made the test.
by gslepak
5/20/2026 at 4:09:44 PM
Well, yes, garbage in garbage out. That's a given and not what's meant by "hallucination" in this context.by rlt
5/20/2026 at 7:36:59 PM
the observation goes beyond garbage in garbage out. Mainly that we're always operating from some prior and limited understanding. That what may look like a hallucination could be closer to the truth than our current frameworks of understanding allow us to admit. The hermeneutic circle.by tantaman
5/21/2026 at 3:41:33 AM
A properly designed benchmark won't use tests that leave room for ambiguous interpretation.by root_axis
5/20/2026 at 7:56:38 PM
Interesting. I wonder if current LLMs can break out of human limitations and understand the world more correctly.by Jacques2Marais
5/20/2026 at 5:45:44 PM
Here are some examples of the questions in the benchmark. If these are representative, they seem pretty cut and dry. https://artificialanalysis.ai/evaluations/omniscience#exampl...by jcheng
5/20/2026 at 8:48:38 PM
Was there something about this specific model and submission that made you feel compelled to write this self-evident observation?Or would you describe your methodology as more like picking a random sentence fragment as an input value then generating completions from your existing corpus without any post-input "learning" process related to the rest of the source material?
by areweai
5/20/2026 at 9:00:12 PM
[dead]by anti-zionist
5/20/2026 at 3:01:08 PM
Truly incredible! Very impressed by their progress. I wonder how much of their own chips did they use for training.by sheepscreek
5/20/2026 at 10:51:32 PM
The big question for me having used a lot of these SOTA chinese models is: what is its token efficiency like?Running Step 3.5 Flash locally for example, it's an amazingly capable model all things considered, but it's token efficiency is so bad that it gets out performed by most others wall-clock time (even with my MTP-support for it hacked in to llama.cpp: despite being trained on three heads, MTP 2 is the sweet spot, and only gets it from 20tk/s to 30tk/s on my Spark)
The DeepSeek models and Qwen 3.5 Plus are also good examples of this: compared to Opus, and especially GPT 5.5 they use many more tokens to get to the same answers.
I'm really hoping that Qwen 3.7 is better in this regard, can't wait to try it out
(ps. running DeepSeek v4 Flash on my Spark is absolutely wild, thanks antirez if you see this haha)
by girvo
5/21/2026 at 12:42:49 AM
Yes it's a big thing that people are slowly becoming more aware of.Nvidia models are even worse than Qwen! https://sql-benchmark.nicklothian.com/#token-efficiency-and-... (mouse over the cells for token counts and click for traces)
Gemma 4 is good for this, as AA notes:
> Gemma 4 31B is notably token efficient, using 39M output tokens to run the Intelligence Index vs 98M for Qwen3.5 27B (Reasoning). This is ~2.5x fewer output tokens for a model scoring 3 points lower. For context, the other models at the 42-point intelligence level also use significantly more tokens: MiniMax-M2.5 (56M), DeepSeek V3.2 (Reasoning, 61M), and GLM-4.7 (Reasoning, 167M)
https://artificialanalysis.ai/articles/gemma-4-everything-yo...
by nl
5/21/2026 at 9:47:35 PM
Right and all of my own evals back this up for Gemma 4......except its notably worse at coding in an agent context even with a harness setup to do exactly what Google says it should do (wrt. to sending summarised thinking back and so on)
So despite it being far better token efficiency wise, it's just worse for what I need to use it for compared to DSv4 Flash or Qwen 3.6 27B
Such a shame, too.
by girvo
5/20/2026 at 3:23:26 PM
wonder at which level there's a capability state transition? 5%? 1%?by baq