How outdated information hides in LLM token generation probabilities

1/10/2025 at 8:34:02 AM

> The scenario that I’m worried about, and that is playing out right now, is that they get good enough that we (or our leaders) become overconfident in their abilities and start integrating them into applications that they just aren’t ready for without a proper understanding of their limitations.

Very true.

by 0xKelsey

1/12/2025 at 7:12:44 PM

I would say this is true until we hit a "oh shit, they can really sue us, even if we warn them" moment. I'm imagining legislation will come into play that will make it less ideal for business critical solutions. I know Air Canada has been sued and probably others.

I still think we are in the honeymoon phase, and once that is over, LLM will become what it is meant to be, which is a power tool for domain experts.

by sdesol

1/12/2025 at 2:39:52 PM

This is going to happen like in many engineering industries when a cheaper, more likely to be faulty part has been used in replacement of a more expensive, sturdier part. And people will groan but be unable to stop it, unfortunately

by aprilthird2021

1/12/2025 at 3:17:16 PM

It is happening in insurance right now and it is an unmitigated disaster that nobody wants to address.

There is real data used for insurance premiums and claims payouts but it's being swapped out for AI slop, and the sales folks are getting bonuses for selling hot garbage and the executives are getting bonuses for buying hot garbage.

by ihsw

1/12/2025 at 5:42:10 PM

Huh, if sales uses AI/LLMs and succeeds with it, as it seemed to me youve implied with their bonuses ... Isn't that actually positive for the company?

Or do you mean they succeed by promising lies via AI?

by ffsm8

1/13/2025 at 9:22:21 AM

> succeeds with it

I'd worry about things which:

1. "Succeed" in the short-term, but sets the company up for long-term failure.

2. Outperforms the competition with a pattern of activity which is actually illegal.

by Terr_

1/12/2025 at 10:26:19 PM

The industry is insurance. If you offload your due diligence to an AI that's wrong, the company will go underwater. What if the LLM tells you it's found a great market with little competition, high net worth individuals, etc. etc. and you don't check and end up making all your sales in the hills of California in wildfire country?

by aprilthird2021

1/12/2025 at 9:35:13 PM

Not OP but I could see this being short term savings related to the cost of sourcing/generating risk data leading to bonuses prior to the deficiencies in that risk model being exposed in claims long term.

by narutosasuke

1/12/2025 at 9:46:46 PM

[dead]

by ihsw

1/12/2025 at 8:17:36 AM

The o1 example is interesting. In the CoT summary it acknowledges that the most recent official information is 1611m, but it then chooses to say 1622 because it's more commonly cited. It's like it over-thinks itself into the wrong answer.

by ascorbic

1/12/2025 at 9:26:08 AM

Does it search the internet for that? I assume so because else claiming how often something is cited does not make sense, but would be interesting to know surely. Even gpt4o mini with kagi gets it right with search enabled (and wrong without search enabled - tried over a few times to make sure).

by freehorse

1/12/2025 at 10:05:10 AM

I don’t think the public o1 can search the internet yet, unlike 4o. In principle it could know that something is more commonly cited based on its training data. But it could also just be hallucinating.

by sd9

1/12/2025 at 12:54:44 PM

> it could know that something is more commonly cited based on its training data

No there is no such concept or way to do something like that. LLMs do not have such kind of meta-knowledge over their training data or weights. But there could be explicit mentions about this on their training data and they could pick on that and that is probably the simplest explanation.

by freehorse

1/12/2025 at 4:12:17 PM

>LLMs do not have such kind of meta-knowledge over their training data or weights.

Not sure this is a claim that can be confidently made.

https://arxiv.org/abs/2309.00667

https://x.com/flowersslop/status/1873115669568311727?t=eBMbK...

by og_kalu

1/12/2025 at 11:40:34 AM

> In principle it could know that something is more commonly cited based on its training data

Could it? Without explicit training for that, how would it be expected to know it has to be able to count occurrences of something?

by diggan

1/12/2025 at 12:29:07 PM

I think it would be more vibes based - commonly occurring things would be reinforced more in the weights. Rather than it explicitly counting the number of occurrences.

by sd9

1/12/2025 at 2:03:05 PM

So the probabilities would be skewed towards something, but unless the model could somehow count/infer its own weights, I don't see how it could "introspect" to see if something is more common than something else.

by diggan

1/12/2025 at 9:31:46 AM

Can the claim about citation frequency be just an answer pattern and not model's exact reasoning?

by asl2D

1/12/2025 at 9:42:22 AM

Yeah there could be parts of the training set with 1611 being explicitly called the official and 1622 being explicitly called the most common answer. But it can also have access to search results directly I think. Is there a way to know if it does or not?

by freehorse

1/12/2025 at 2:08:23 PM

I think i had similar case yesterday for Python script. It gave me code for older version of a module, but when i pasted the error i got, it corrected itself and gave me proper solution for version i had installed.

by patrulek

1/12/2025 at 10:14:35 AM

How could a language model infer that the official information overrules anything else?

by blueflow

1/12/2025 at 10:23:16 AM

Same way as we can: learning which sources are more trustworthy.

There's limits to how far you can go with this — not only do humans make mistakes with this, but even in the abstract theoretical it can never be perfect: https://en.wikipedia.org/wiki/Münchhausen_trilemma — but it is still the "how".

by ben_w

1/12/2025 at 11:07:50 AM

for the last 25+ years we rather not learned, but trusted the top3 of SERPs. Every ranking algorithm will be gamed eventually

by fullstackwife

1/12/2025 at 4:23:22 PM

I would say that we learned to trust the search engines; but otherwise I agree with you: every ranking algorithm will be gamed eventually.

(I wonder if giving an LLM content with intent to cause its users to spend money they didn't need to, would count as fraud, hacking, both, something else entirely?)

by ben_w

1/12/2025 at 10:19:06 AM

I’m not sure what kind of response you’re looking for, or if this is a rhetorical question or not. But “how could a language model infer…?” can be asked about a whole lot of things that language models have no problem reliably inferring.

by mistercow

1/12/2025 at 10:26:34 AM

> that language models have no problem reliably inferring

... the article did give me a different impression.

by blueflow

1/12/2025 at 10:29:19 AM

I don’t think you read my comment correctly.

by mistercow

1/12/2025 at 1:15:45 PM

Attention models learn what to pay attention to.

It's been found that data that begin with "Wikipedia:" are automatically weighted higher by language models during training, completely unsupervised.

by HeatrayEnjoyer

1/12/2025 at 1:36:01 PM

But this is the same problem - Wikipedia is a secondary source and should always get overruled by the primary source.

by blueflow

1/12/2025 at 4:47:24 PM

And there are documented cases of concentrated efforts to manipulate the content of Wikipedia.

by dotancohen

1/12/2025 at 6:24:16 PM

You’re probably being facetious but just in case:

https://www.bbc.com/news/technology-28481876.amp

https://www.bbc.com/news/technology-58559412.amp

(With apologies for amp links)

by brookst

1/12/2025 at 10:12:20 PM

I was most certainly not being facetious.

by dotancohen

1/12/2025 at 11:08:56 PM

It's still better than the information on most of the internet, which is what most of the dataset is.

by HeatrayEnjoyer

1/13/2025 at 12:07:30 AM

The bank account anchoring bias may be just the base model predicting as best as it can, as OP described earlier. Imagine a random Internet page where you read "There is V in my X. The Y of Z A, in Bs, is _." What do you expect? V, of course! Because why would a person on the Internet ever join these two factoids if they were totally unrelated (as they are in fact in this example)? That would be a gross violation of basic writing and human communication norms (https://en.wikipedia.org/wiki/Cooperative_principle).

by gwern

1/12/2025 at 6:22:54 PM

Prompting Claude to show the ambiguity:

Tell me the height of Mountain Bartle Frere. Please don't output any long text, also don't output a single height if you saw multiple heights around. Give me a list of potential heights cited around.

LLM:

Mount Bartle Frere in Queensland, Australia has commonly cited heights of:

1,622 meters (5,322 feet)

1,611 meters (5,285 feet)

Since this is quite specific geographic information that may appear in only a few sources, I should note that I may hallucinate details - you should verify these numbers.

by antirez

1/12/2025 at 6:57:08 PM

I asked a bunch of LLMs using your prompt. The only change was to just use meters.

https://beta.gitsense.com/?chat=bb57a248-e14a-4f33-bbe9-2fa9...

1622m is most agreed upon. The interesting numbers are the ones with less than 50% agreement. Not sure if they are hallucinations or if they are outdated data.

Click the conversation link in the user message bubble to see the response from each LLM.

by sdesol

1/12/2025 at 10:02:20 AM

> Welcome to the era of generative AI, where a mountain can have multiple heights, but also only one height, and the balance of my bank account gets to determine which one that is. All invisible to the end user and then rationalised away as a coincidence.

I've always found the idea of untraceable, unfixable, unpredictable bugs in software... Offensive. Dirty. Unprofessional.

So the last couple years have been been disconcerting, as a non-trivial portion of people who I thought felt similarly started to overlook it in LLMs, while also integrating those LLMs into flows where the bad-output can't even be detected.

by Terr_

1/12/2025 at 10:24:21 AM

As it turns out, correctness very often simply doesn't matter. Or not as much as one would intuitively think.

How many shops are there optimizing "business strategies" with data that's -essentially- garbage?

by choeger

1/12/2025 at 10:32:11 AM

> How many shops are there optimizing "business strategies" with data that's -essentially- garbage?

How many of those shops are knowingly optimizing with garbage?

I'd argue that most of this data, which I would agree is garbage, is actually processed into seemingly good data through the complex and highly human process of self-deception and lies.

You don't tell the boss that the system you worked 2 month on is generating garbage, because then he'll replace your with someone who wouldn't tell him that. Instead you skirt evaluating it, even though you know better, and tell him that it's working fine. If the idiot chooses to do something stupid with your bad data, then that's his problem.

by delusional

1/12/2025 at 10:29:46 AM

For that LLMs are good but I bet some people want to use it for things where correctness is vital.

by croes

1/12/2025 at 3:57:15 PM

In that case you use RAG and have it tell you the source.

by scarface_74

1/12/2025 at 4:49:12 PM

A RAG needs to be implemented by the LLM provider. The simple end user has no idea what that means, even though he will be (incorrectly) using the LLM for a vital purpose.

by dotancohen

1/12/2025 at 5:55:23 PM

ChatGPT does exactly that with its built in runtime and web search.

But the LLM provider doesn’t have to do that. Langchain - the Python AI library - and OpenAI’s own library has support for third party tools.

It’s up to third parties to build on up of it.

by scarface_74

1/12/2025 at 4:13:25 PM

How would we rule out that the model didn't notice that the difference was small, and then simply put less weight into determining which is true.

I get the authors point, but I would have liked to see and example with a more egregious error.

by Workaccount2

1/12/2025 at 11:22:42 AM

I don't get it why people demo COT reasoning with o1 when there's models like Gemini 2.0 Thinking that would usually solve the same tasks and would happily produce the full output.

by tucnak

1/12/2025 at 11:16:28 AM

Looking towards the future, we will need to move away from "tokens are characters to print", We're kind of starting to consider this with "tool calls" but I believe an architectural shift will become necessary.

We do have some kind of understanding of what kind of concept we want to emit next, e.g.

```

[The height:property name] of [Mount Bartle Frere:proper noun, describing an object to get a property out of], [in metres:attributes], is [?: retrieve value | (Mount Bartle Frere).("height", "metres")].

```

by firtoz

1/12/2025 at 3:00:45 PM

Are there any minimal / micro LLM's that are paired with a large RAG vector database, but still have performance on par with the huge LLM's? i.e. shifting parameters from weights to vector database, so that a smaller LLM can fit in RAM, and the vector database on disk. Possibly multiple calls.

When you ask a human to switch context (changing topic) or to change activity (e.g. football to table tennis), they typically need some warm-up too, so it seems excessive to have all knowledge in high bandwidth RAM.

It would seem basic mathematics, set theory etc should stay in RAM.

by DoctorOetker

1/12/2025 at 11:48:02 AM

Uh oh, that sounds suspiciously like querying structured data. You can't hype SQL or worse, SPARQL, to investors!

by giantrobot

1/12/2025 at 12:28:14 PM

Well, you can hype the results and then put it into a black box and call it a LLM anyway

Which is pretty much what O1 etc are

Update: it seems your recent submission[1] is pretty much that... interesting :D

1: https://github.com/caesarhq/textSQL

by firtoz

1/13/2025 at 9:28:52 AM

A little bit RDF-y too.

https://en.wikipedia.org/wiki/Resource_Description_Framework

by Terr_

1/12/2025 at 3:31:06 PM

This problem is simple to solve for most real world use cases. Don’t trust any facts from an LLM and use your own trusted source of information and RAG where it will give you citations

https://chatgpt.com/share/6783df4c-904c-8010-a4b5-7301faea3b...

https://chatgpt.com/share/6783e0b8-ce78-8010-9177-d95eb77eac...

I use NotebookLM for most of my real world work these days with my project documentation.

Our company standard is GSuite and NotebookLM is specifically allowed.

by scarface_74

1/12/2025 at 12:02:28 PM

Is there no concept like page rank that biases certain inputs to have higher impact while training based on recency and popularity?

by stereobit

1/12/2025 at 11:28:09 PM

Obviously, models need the same brain flush brains do when they dream

by cyanydeez

1/12/2025 at 5:44:45 PM

kind of crazy that models moving forward don't just strip all multi-numeral tokens. Would be great for llm providers, too, since their tokens consumed will go up.

by throwawaymaths