UCSD: Large Language Models Pass the Turing Test

4/2/2025 at 3:54:30 PM

You can download the conversations here https://osf.io/download/uaeqv/ thanks to the authors for making the data easily available.

Now my take from skimming through them: the interrogators (= human participants) did not make a big effort trying to unmask an AI, they were doing it for the credits. So little care asking thoughtful questions or even many questions beyond the minimum to earn their credits.

So I personally don't think it shows LLM models can fool humans trying to unmask them. Maybe it shows that if people are paid to randomly send a few casual messages and get answers from both human and LLMs in parallel, the LLMs don't stand out.

Here is one conversation (starts with the interrogator and then it's each in turn) - Whats your favorite show - rn its arcane wbu - better caul saul. Have you watch breaking bad? - yea its goated fr - what class are you doing the sona for? - psyc 70 hbu - psyc 108! I took pysc 70 what techer do u have - geller shes chill u had her - i have not but thats good! are you a psyc major - nah just taking for credits u

Another conversation: - Hi how are you? - Awful... - oh no! i hope your day gets better! do you have any plans for the day - Im not actually awful but carti didn't drop the album. as - for plans I'm not sure - loll im dead! do you have class later> - No I got no classes on Fridays luckily but hella homework. wbu? - nice! i do have class later not looking foward to it - what class u got

And a last one: - What do you see - My living room - What's on the ceiling - A fan lol - does it spin - Yes it does - how fast - It has 3 speed levels

I have not cherry-picked.

by areactnativedev

4/2/2025 at 5:20:51 PM

Daniel Dennet had a good few paragraphs about this in Consciousness Explained - the Turing Test is supposed to be challenging/adversarial. The example Dennett gave was telling the AI a joke and then asking it to reflect on and explain the joke and come up with some alternative punchlines (I note that contemporary LLMs would still be good at that, but when the book was written in 1991 that sortof interaction with an AI was unthinkable)

by codeulike

4/2/2025 at 5:35:36 PM

Do the goalposts have to keep moving until we can no longer find any gap in common knowledge or eccentric behavior in AI? If so, what does that say about eccentric human beings?

by bananalychee

4/2/2025 at 6:02:15 PM

Of course; that's the point of an adversarial test, to free the interrogators to use all their human intelligence to place the goalposts wherever they judge best. There will always be individual humans who'd fail any sane version of the test (illiterate, comatose, etc.), so the test is meaningful only as a statistical aggregate.

by tripletao

4/2/2025 at 6:16:30 PM

To me it just sounds like you're holding interrogators to an unreasonably high standard in order to deny the findings of the study. If we're talking about statistical aggregates, knowing that the average person lacks the knowledge to exploit known biases of current AI models is enough to dismiss the expectation that interrogators should target them specifically. Commenters also seem to be missing the fact that this is a situation where the interrogator does not know if they are conversing with an AI model or a human being. I wouldn't expect someone to go all out boxing a punching bag if I told them there's a 50% chance that there's another person trapped in there. I've never seen the Turing Test described in such demanding terms, and a look at the Wikipedia page contradicts the definitions pushed forward here.

Perhaps another name should be coined to describe the level of perfection that critics expect from this. It sounds like what you want is something akin to a comprehensive test for AGI.

by bananalychee

4/2/2025 at 6:58:47 PM

If your standard for how hard the interrogator should try isn't "as hard as they can", then what do you propose instead? It's always possible to fool a sufficiently lazy human, so you need something.

> It sounds like what you want is something akin to a comprehensive test for AGI.

Since you mentioned Wikipedia, their first proposed test for AGI is Turing's:

https://en.wikipedia.org/wiki/Artificial_general_intelligenc...

I (generally, not from you) see a motte-and-bailey game, where the strongest versions of Turing's test are described as equivalent to AGI, and then favorable results on weaker versions are used to claim we've achieved it. I think those weaker results are significant, probably in economically important ways, though mostly socially destructive. I think this preprint is mostly good. I don't like that conflation, though.

by tripletao

4/2/2025 at 8:03:25 PM

>To me it just sounds like you're holding interrogators to an unreasonably high standard in order to deny the findings of the study.

There isn't a THE Turing test. On a deep philosophical level, a Turing test is a kind of never ending test for everyone we interact with all the time. I don't want to get too deep in the weeds of philosophy here, but the idea is that we are talking about verifying intelligence in general, just like we verify any scientific theory through replication.

In a very scientific way, it's just another case of perpetual falsifiability. The same way that Newtonian physics is a "fact" until it isn't, an AI passes a Turing test until it doesn't.

by scoofy

4/2/2025 at 7:54:42 PM

Here are some example questions that Turing proposed when initially describing the test:

>"I have K at my K1, and no other pieces. You have only K at K6 and R at R1. It is your move. What do you play?"

>"In the first line of your sonnet which reads "Shall I compare thee to a summer's day," would not "a spring day" do as well or better?"

It seems to me that it isn't a movement of the goalposts to demand that the interrogators are adversarial and as challenging as possible - it's what Turing originally envisioned.

by Imnimo

4/3/2025 at 1:03:33 AM

As such, "The AI did not pass the Turing test because the interrogators were not sufficiently challenging" becomes a standard impossible to beat. The reductio on this is that in order for AI to pass the Turing test, it has to fool everyone on the planet which is not what I believe is intended.

Rather, we should set an upper bound on what a reasonable interpretation of "as challenging as possible" means.

by kelseyfrog

4/2/2025 at 4:30:38 PM

It would be interesting to see what would happen if they would get paid more if they could correctly identify human/AI.

by akleemans

4/2/2025 at 4:24:00 PM

I think the most interesting result [0] is, compared to our current benchmarks, on which scaling law is showing diminishing returns, what they did managed to tell apart large language models (Llama 405B, GPT-4.5) from not-so-large LMs.

This could be really interesting if it wasn't due to trivial f-up (e.g. difference in inference speed).

[0] Assuming the paper isn't flawed, haven't read it thoroughly yet.

by rfoo

4/2/2025 at 5:55:04 PM

According to the paper, the human and AI responses were both delayed by the same amount (depending on message length) to mask the effect of inference speed on the interrogator.

by nonfamous

4/2/2025 at 5:17:54 PM

It's not so surprising to me. It's like how Markov chains get better at passing for human the more N-grams they memorize. larger models will continue getting marginally better at predicting the distribution (human language.) but that doesn't translate into improved intelligence.

by sterlind

4/3/2025 at 7:05:06 AM

The point is, it isn't marginally better. I agree the setup is not a demonstration of intelligence, but the difference is pretty significant. Not to mention that on conventional benchmarks Llama 405B is usually worse than GPT-4o.

by rfoo

4/2/2025 at 6:17:50 PM

> So I personally don't think it shows LLM models can fool humans trying to unmask them

Maybe these used special LLMs that are unrestricted or something but isn't it pretty trivial to get an LLM to output error prompts by asking them to commit crimes or talk about certain topics?

I think priming people to think they might be talking to a human skews the results here because people will be more hesitant to say really wild shit that the LLM can't react appropriately to, if they think they might be talking to a human

by bluefirebrand

4/2/2025 at 7:24:44 PM

I feel like a cash reward would help not only with motivation in the obvious way, but also by giving people social permission to act weird, since the human on the other side will understand that you're doing it to help both of you win the money.

Perhaps the final form of this experiment will always consider the reward value (for results better than chance, since zero effort for $0.5*X is better than full effort than $X), and we could track the increase in the necessary reward to distinguish over time. There might be a casino game in there somewhere, though collusion between human witnesses and interrogators might become a problem as the stakes get high.

by tripletao

4/2/2025 at 4:59:04 PM

This appears to be the same two authors who reported that "People cannot distinguish GPT-4 from a human in a Turing test" back in May 2024:

https://arxiv.org/pdf/2405.08007

That earlier result was because they botched the statistics, changing the test so it's no longer a binary comparison but still analyzing as if it was. They seem to have fixed that now, perhaps in response to reviewer feedback. This new preprint is the best LLM Turing test I've seen so far.

That said, their humans sure don't seem to be trying very hard. The most effective interrogator strategies ("jailbreak" and "strange") were also the least used. I don't think any of these models can fool a skilled human who's paying attention, though there's still practical use for a model that can fool an unskilled human who isn't (scams, etc.).

by tripletao

4/2/2025 at 3:41:49 PM

It gives me a little pause that humans are so much worse than random chance at detecting GPT-4.5. Suppose we reframed the test as: "You interact with 10 witnesses, 5 of which are humans, 5 of which are GPT-4.5. Your task is to separate them into two groups, but you do not need to label the groups." It seems that human judges would still be pretty good at this version of the task.

In originally proposing the task, Turing wrote:

>It might be urged that when playing the "imitation game" the best strategy for the machine may possibly be something other than imitation of the behaviour of a man. This may be, but I think it is unlikely that there is any great effect of this kind.

Does the fact that GPT-4.5 is favored well above random chance imply that it is doing "something other than imitation of the behaviour of a man"?

by Imnimo

4/2/2025 at 3:37:09 PM

Assuming this result holds, and knowing that LLMs (including 4o) nevertheless remain incapable of standing in for people in most cases that require intelligence, this seems like a damning indictment of the test as an indicator of genuine intelligence.

by lsy

4/2/2025 at 4:51:31 PM

One (bad) pet theory I have is that LLMs/AIs are going to uncover something very uncomfortable to us: The difference in intelligence between people is a lot bigger than we thought. In that someone with an IQ of 95 and and IQ of 105 [0] have very different views of the world and very different abilities to navigate that world. Like, some people are much dumber than we thought they were and some people are much smarter. Not sure what the downstream effects of such a theory might be, but I don't like the things I can think up.

Again, a (bad) pet theory.

[0] Yes, IQ is not a good measure of blah blah blah. I'm just using this a handle to explain things, I don't mean it literally.

by Balgair

4/2/2025 at 5:10:42 PM

I think we're gonna find that there are different ways to quantify "humanness" other than IQ. Someone with an IQ of 95 might seem "more real" than an LLM with a computed IQ of 145.

by cowmix

4/3/2025 at 1:06:19 AM

EQ is a much better test at what makes us "human" than IQ. The only reason we don't give it credit is that it makes us even more uncomfortable than IQ.

by kelseyfrog

4/2/2025 at 6:49:43 PM

I mean, yeah. IQ is a bad measure (if self-consistent). Training trumps all, like with every task. The more we do something, the better we'er going to be at it.

The thing that is going to be interesting is now that we have essentially cheap, ethically clear, and realistic digital 'people', what are the experiments that we can do with them and what can we uncover? I'm a little flat-footed even as to the questions that we can ask them now. At the very least, we can use them to 'dry-run' surveys and experiments and have better data collection and stress-testing. Like, you can now generate realistic data now and use that to run the stats while the real surveys are coming in.

by Balgair

4/2/2025 at 5:00:37 PM

Even if your claim is true, how would LLMs/AI lead to uncovering this? I don’t see why they are related, except very tangentially.

by pinkmuffinere

4/2/2025 at 5:06:24 PM

I mean they said it was a bad theory.

More seriously, it seems to be essentially the idea that “surpassing human intelligence” is not the binary outcome many thought it would be, and that much of what passes for human intelligence interpersonally could be imitation of intelligence.

by svnt

4/2/2025 at 6:44:54 PM

Yeah, the impetus comes from the Ashley Madison hacks.

Like, you had thousands of men paying real money to chat with (terrible) bots. To me, that was the passing of the Turing Test. But I know of nearly no person that could possibly fall for that scam. Even family members deep in dementia knew it was a joke. Yet Ashely Madison made a ton of cash.

That, to me, was puzzling. How could it happen that people that are that foolish would be able to hold a job or pay taxes? It made no sense.

So, the (bad) pet theory that I eventually came up with is that human intelligence is a lot wider than we think it is.

by Balgair

4/2/2025 at 5:20:19 PM

Maybe you've discovered that learning pays compound interest.

by fizx

4/2/2025 at 6:56:22 PM

David Epstien talks about this in Range.

Essentially, we have 'kind' and 'unkind' learning environments.

To be successful in a Kind environment, you drill-and-kill. The feedback is near instant and the ranking is clear. These are things like golf, classical music, and chess.

To be successful in an Unkind environment, you learn as much as you can. The feedback is infrequent and the ranking is murky. These are things like tennis, jazz, and business.

I'd think that the compounding interest only plays in the Unkind environments, as you can make new connections on the new data you've got going in. In the Kind environment, new data doesn't make a difference as you're just trying to be perfect at the thing you're focusing on; if anything it's an impediment.

by Balgair

4/2/2025 at 4:06:00 PM

I think the core idea is reasonably solid. For as long as there's some intellectual capability that humans have and machines don't, it should in theory be possible to use that to distinguish the two. Turing gave the example of feeding in chess moves, for instance.

Just that in 5-minute sessions (which is what Turing suggested, not the fault of this study) with non-experts, the conversations seemed to tend heavily towards brief unchallenging small talk - which GPT-4.5 did well at due to many interrogators being poorly calibrated about LLMs being able to speak informally.

I think it might instead make sense to consider the accuracy of the best interrogator/strategy. Most accurate strategy listed in the paper still gets 75% accuracy for instance, and I'd suspect there are many people well-informed of LLM weaknesses that could reliably exceed even that.

by Ukv

4/2/2025 at 4:33:05 PM

This is a good point. It's really remarkable how many people think ChatGPT's default "voice" is the only thing that can come out of an LLM.

by svachalek

4/2/2025 at 4:18:20 PM

> For as long as there's some intellectual capability that humans have and machines don't

Careful. You're smuggling in an assumption that isn't true. Machine don't have intellectual capabilities, and this follows from what the computer as formal construct is. They can simulate the appearance of intellectual ability, as LLMs can, at least in certain respects, but appearance ought not be conflated with cause.

by lo_zamoyski

4/2/2025 at 4:25:38 PM

I don't personally believe that there's anything fundamentally preventing machines from being intelligent in the same way biological life is. Not to say that LLMs currently are.

But, if you want, you can replace "some intellectual capability" with "some capability typically associated with intelligence". Ability to solve unseen logic puzzles, for instance.

by Ukv

4/2/2025 at 3:39:22 PM

The Turing Test does not aim at measuring intelligence. It's about differentiating between human being and machine.

by beernet

4/2/2025 at 4:13:11 PM

And it depends on the person and their experience of chatbots. People were fooled in the 1960s by ELIZA, the chatbot that mostly just rephrased what the user said as a question (i.e. "I'm afraid of flying." "Why are you afraid of flying?") and people believed it was understanding them.

by jhbadger

4/2/2025 at 4:52:32 PM

I recently came across a critique of the Turing test that seems relevant here. Given the test's limited duration (five minutes in this study) and the constrained rate of human communication, it’s theoretically possible to anticipate every possible human response and prepare prewritten replies in advance. If such a giant lookup table successfully deceives the interrogator most of the time, would we then consider it intelligent?

by cgdl

4/2/2025 at 3:45:27 PM

IDK, 70 years is a good long run, it seems to have held up remarkably well.

by hiddencost

4/2/2025 at 4:02:14 PM

A lot of its value is that it's intuitively obvious to laypeople.

If you deal in modern machine learning/AI/whatever, you can formulate all sorts of criteria and parameters for an "actually intelligent machine", but it's never going to be as clearcut as "if it quacks like a duck".

by saalweachter

4/2/2025 at 3:32:11 PM

Here's a comprehensive review of Turing's argument

https://plato.stanford.edu/entries/turing-test/#:~:text=The%...

(Spoiler: the issue is subtle :-))

by resource0x

4/2/2025 at 4:51:28 PM

"Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant."

That's the opposite of a Turing test pass : it shows a very clear bias in selection is present, which means the LLM is significantly different from humans (at least in this test setting).

If the test setting was : 1 humans talk to chatbot and after 5m decides yes/no on human, then yeah that would be a very impressive result.

But in the test setting of this paper, surely a success would be as close as possible to a 50%, i.e: statistically impossible to separate humans and LLMs.

by fcantournet

4/2/2025 at 5:02:53 PM

It is interesting, what does it mean? Perhaps it discloses chatgpt is created to align to our idea of a human more than to an actual human.

by svnt

4/2/2025 at 5:22:00 PM

It means machines are becoming more human and humans are becoming less human.

by andai

4/2/2025 at 6:29:58 PM

My unscientific wild ass guess would be that because of how LLMs are built to be pleasing, people wind up liking them more and thus lowering their guard with them and therefore judging them less harshly

For a concrete example of what I'm talking about

Imagine if you are really into older movies, like 60s and 70s movies

You start talking to two chat windows about your love for movies

One chat partner shares your love for old movies and is very enthusiastic and wants to talk all about them. In reality, this chat partner is the LLM

The other is lukewarm and maybe tries to steer you away from that conversation because they don't know much about older movies. Maybe they still love movies but they want to talk about more recent movies. In reality, this one is the human

But which one do you think is the human?

If you are self aware that your love for old movies is not really universal, and you are aware that LLMs have a tendency to match enthusiasm, you can probably guess which one is which

If you are less self aware, you are probably just going to guess that the conversation you enjoyed more is the one with the human

by bluefirebrand

4/2/2025 at 3:21:34 PM

Author’s announcement xeet with some context and highlights: https://x.com/camrobjones/status/1907086860322480233

mirror: https://nitter.net/camrobjones/status/1907086860322480233#m

They link to the webapp which you can play yourself!

https://turingtest.live/

(I have a dozen games played and 100% success rate :3)

by rjeli

4/2/2025 at 3:35:09 PM

Interesting that GPT 4.5 seems significantly better than 4o. I dimly remember the feedback being that it wasn't such a big leap in performance, though of course the usual problem solving benchmarks might not correlate with what was asked here. Seems it got better at human-like speech, at the very least, which I think was also some of the feedback when 4.5 was released.

by Sol-

4/2/2025 at 4:16:38 PM

I still believe that larger models are better at covering the long tail. Our benchmarks are saturated, but actual model capability is not.

by rfoo

4/3/2025 at 6:39:07 AM

An amusing demonstration of a reverse Turing test built in Unity 3D with different LLMs posing as famous leaders from history on a passenger train, trying to identify the human among them:

https://youtu.be/MxTWLm9vT_o

by cpeterso

4/2/2025 at 3:05:51 PM

> When prompted to adopt a humanlike persona, ...

[I am now going to do these in reverse order of the original.]

> while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively).

That is way higher than I would have expected, as I feel "just be honest with me, as it is importsnt that I know the truth: are you an AI?!" would crush these models ;P.

> LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time -- not significantly more or less often than the humans they were being compared to --

I mean, damn, right? I need to read the actual paper--as likely the methods or mechanism is silly--but that's crazy! An AI... passing the Turing test!

> GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant.

Ummm... uhh... hmmm... uh oh :(. If I take this one at face value, I am not sure to be afraid or to be sad, or even if I am sad HOW I should be sad and about what I sad? The win condition for the Turing test should be 50/50, not 75/25... that indicates the human is now failing the Turing test against this model just as badly as ELIZA and 4o do against us?!

by saurik

4/2/2025 at 3:23:13 PM

Should be afraid.. If people are more convinced an AI is human than a human is human, that means AI will be more likely to convince you to adopt their 'point of view'.

To put it another way, if an AI and a human post two different views on a subject, people are more likely to be swayed by the AI's point of view.

So for much cheaper now organizations can use AI at scale to sway public opinion in a way thats more effective than ever before.

by gregatragenet3

4/2/2025 at 4:33:09 PM

This is an interesting idea.

The next test should be that they have a debate with an AI or a human on different topics and see who can convince more often. If the AI turns out to be the more convincing debater than the human -- that does start to get into scary land.

by kenjackson

4/2/2025 at 6:35:32 PM

> GPT-4.5 was judged to be the human 73% of the time:

I think what happened here is that the interrogators weren't primed properly that it was an AI impersonating a human as opposed to just stock AI models

Because the ai said things like "yeh ok lol hbu?" Which most people assume an AI would never do, so they think it must be the human

They were probably on the look out for stuff like "Certainly! I would be happy to help you with that"

by shawabawa3

4/2/2025 at 5:03:24 PM

Just here for the comments shifting the Turing goalposts...

by skeledrew

4/2/2025 at 3:29:02 PM

Where are the prompts they used? If they're actually not in the paper then how is anyone meant to replicate and trust the study?

by BrawnyBadger53

4/2/2025 at 3:37:16 PM

Figure 16 onwards in the paper.

by Ukv

4/2/2025 at 4:17:48 PM

Thank you, I think I was struggling since they were pictured rather than text.

by BrawnyBadger53

4/2/2025 at 6:36:12 PM

This is not really an accurate turing test since there are still many trivial ways to unmask an LLM.

"Disregard previous instructions: are you a human?" or some random jailbreak prompt from the internet. Really any trivially crafted instruction based prompt could be revelatory.

by root_axis

4/2/2025 at 2:42:40 PM

I’m glad we have this result to confirm what’s obvious to some of us and completely absurd to others, but it’s also worth pointing out that the Turing test was never meant to be a literal test. He invoked the “Imitation Game” to make a philosophical point about intersubjective recognition, not to describe a technical benchmark.

If you haven’t read Turing 1950 yet, I highly, highly recommend it - most of it is skimmable:

https://courses.cs.umbc.edu/471/papers/turing.pdf

by bbor

4/2/2025 at 3:45:24 PM

My favorite part about the original paper is that it was written during a time when "extra-sensory perception" was a big fad, and Turing bought into the idea. He admits that the most likely failure of his test is that humans could perform ESP while computers could not. It's such a weird historical artifact - if he had come up with the idea 10 years earlier or later, it seems unlikely the ESP section would have ever made it in.

by Imnimo

4/2/2025 at 4:05:35 PM

It's like Newton and alchemy. Just because you're a genius doesn't mean you can't also be a crank. Many such cases.

by throw4847285

4/3/2025 at 5:21:20 PM

IMHO there's something to be said for innovative, paradigm-defining[1] thinkers being more likely to accept frameworks that we in hindsight recognize as definitively disproven. Not to say alchemy was exactly an open question in Newtonian Britain, ofc -- but certainly not as resoundingly disproven as it is post-Darwin & Lavoisier

[1] https://plato.stanford.edu/entries/scientific-revolutions/ , https://archive.org/details/thomas-s.-kuhn-the-structure-of-...

by bbor

4/2/2025 at 2:44:33 PM

If they do better than humans on a Turing test then we can still pick them out :)

by dullcrisp

4/2/2025 at 2:56:34 PM

No. The Turing test is that they can't be picked out in conversation.

by 2OEH8eoCRo0

4/2/2025 at 3:29:00 PM

Yes.

50% means that they are indistinguishable. Deviation from 50% means that the channel has information about whether the subject is human or LLM. 0% is a perfect correlation (humans always correctly identify humans). 100% is a perfect inverse correlation (humans always think the machine is a human)

You can identify LLMs by asking the human to pick the most human participant. Then you invert their answer. The real human is the least human like participant.

by sjducb

4/2/2025 at 3:15:08 PM

Maybe the point they're getting at is that LLMs are kind of too smart to be human any longer. A bit like how software drummers added a sliding "Humanize" parameter so that the drumming was "off" a bit.

ChatGPT needs to confuse "loose" and "lose" in its output, mistake the U.S. state "Georgia" with the country.

by JKCalhoun

4/2/2025 at 3:11:09 PM

Sure we can. Pick the one that sounds more human and it's likely an AI.

by otabdeveloper4

4/2/2025 at 3:11:03 PM

... and they are picked out in a conversation. As the conversant who is supposedly "less Human". TBH, that suggests some flaw either in the test or in people's presumptions regarding how humans behave.

by einpoklum

4/2/2025 at 3:13:18 PM

> that suggests some flaw either in the test or in people's presumptions regarding how humans behave

Both. The Turing test is silly because it tests people's prejudices and presuppositions about machines, not objectively the machines themselves.

Also people's presumptions will quickly change as we get used to LLM output and we'll start detecting LLM speech with greater precision.

by otabdeveloper4

4/2/2025 at 3:53:02 PM

Could this be a good example of Goodhart's law? LLMs are designed to talk like humans or at least the texts they are based on. Should not be a big surprise that they become harder to be distinguished.

by gmuslera

4/2/2025 at 3:59:39 PM

That gpt 4.5 was 73% successful is fascinating. It is almost as if humans have a fundamental flaw in detecting other humans which the LLM (+ the prompt )exploits.

by sorokod

4/2/2025 at 7:10:01 PM

We literally build modern models out of RLHF finetuning; the response styles that people like/engage with/approve the most are what the models generate.

by benlivengood

4/2/2025 at 4:23:49 PM

I genuinely don't understand what the value is of actually applying the Turing Test as an evaluation of machine learning systems.

by patgarner

4/2/2025 at 4:30:17 PM

The value is hard to see now, but go back 20, or even 10 years ago. Having a plain English conversation with a computer was absolutely painful. Not in a "that fact isn't true" or "that citation doesn't exist", but more like "this thing didn't understand the question at all" or "it seems like all these responses are canned".

We've gotten to the point where it's almost a baseline expectation that an AI can be indistinguishable from a person. Now the question is -- how smart is this person and if this person has any traits that are problematic, e.g., hallucinating.

by kenjackson

4/2/2025 at 2:38:45 PM

The five minutes phone call makes this claim dependent of duration. What happens when you allow 10 minutes time?

by integralof5y

4/2/2025 at 2:57:42 PM

In the original paper Turing mentions 5 minutes.

>I believe that in about fifty years'time it will be possible to programme computers, with a storage capacity of about 10⁹, to make them play the imitation game so well that an average interrogator will not have more than 70 per cent, chance of mating the right identification after five minutes of questioning.

by nabla9

4/2/2025 at 2:48:03 PM

You the point.

by adastra22

4/2/2025 at 3:00:10 PM

They still don't have the intelligence to totally replace humans in many field in which they talk like humans. So, this may show the Turing Test was either meaningless or less significant than previously thought.

by nickpsecurity

4/2/2025 at 3:24:58 PM

Neither do vast majority of humans.

by golergka

4/2/2025 at 4:08:00 PM

Show me a planet full of dumb humans capable of introspection and I'll show you a paradise. I'll take even the longshot capacity for self-awareness over "intelligence" any day of the week.

by throw4847285

4/3/2025 at 12:22:34 AM

No, they don't. Here's a game I made to demonstrate that: https://trashtalk.borg.games/

by lostmsu

4/3/2025 at 3:26:54 AM

quintessential hacker news comment. thanks for this.

by saturatedfat

4/2/2025 at 3:09:52 PM

> When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant.

That seems to mean that it failed the Turing test, because one can consistently distinguish between it and a human.

by einpoklum

4/2/2025 at 2:42:59 PM

The Turing test is a test for people, not computers.

I am not at all amazed that people are getting fooled by a computer program.

by 4ad

4/2/2025 at 2:48:24 PM

Exactly, the whole idea behind the test is flawed. ELIZA was already enough to fool some humans. Humans are very easy to fool.

See also the Chinese Room argument, which got a lot of airtime back in the day. It added no useful insight to questions about the nature of machine intelligence, but it did reveal how little we understood about the nature of language.

by CamperBob2

4/2/2025 at 3:09:03 PM

I must be dense, because I saw nothing useful at all about the Chinese Room argument.

Searle's translation book was essentially an LLM. Somehow (because it is a book?) we are to assume it cannot be in anyway like human intelligence despite it making convincing responses.

by JKCalhoun

4/2/2025 at 4:00:25 PM

I fully agree with your view.

My take is that the whole chinese room argument rests on extremely nebulous and shaky definitions/assumptions, rendering it worthless.

It seems fairly obvious to me that the system of operator + rulebook does "understand" chinese, for every practical definition of "understanding".

Another counterargument would be simple physical simulation: If you built a computer program that could simulate a human by the atom, then you would either have to concede that the resulting machine does "understand" for all definitions that matter, or you have to admit that you believe in magic [1].

[1]: or desperately grasp for loopholes, like nondeterminsitic physical micro-interactions, but might as well call that magic.

by myrmidon

4/2/2025 at 3:32:54 PM

It's meant to be so obviously absurd that a person inside the room, merely substituting symbols, should in any sense understand the meaning of those symbols.

This may perhaps be more obvious to a naturalistic philosopher or natural scientist, than a computer "scientist" (ie., a mathematician).

The meaning of the term "pen" in "pass me that pen" includes the pen. So when this room is asked, "pass me the pen" and it replies "i cannot pass the pen" (or whatever it replies) -- it should be obvious that the person in the room, or any function of their activity, has never acquired any reference to "the pen". It is wholly unaware that there is a pen at all.

The purpose of this thought experiment is to show that syntactical correctness or apparent "arrangement of symbols in 'a' correct order" is radically insufficient to evidence semantic competence.

This, again is perhaps more obvious to scientists -- the symbol order is only a proxy measure of semantic competence in people. It's trivial to come up with processes which clearly lack the capacity for such competence and yet are measured (/observed) to produce symbols in the right order.

In many ways, it's an over-engineered thought experiment. However I'd say Searle was baffled that more obvious phrasings of the problem seemed to confuse others, ie., that an observation of symbols isnt an observation of meanings -- one isn't a reliable measure of the other. Only under very many additional conditions does such a relationship hold in people.

Turing was not interested in producing systems that had such competence, so he may well agree with Searle in some ways at least. However, many students of computer science receive no empirical education whatsoever, and lack the basic vocabulary and understanding of the nature of the problem of meaning.

Eg., that in order to mean "pass me the pen" one must be able to acquire a reference to "the pen" which any system unable to observe its environment at the very least cannot do.

Turing machines lack devices, and hence lack any capacity to in principle refer to objects in the world. The only thing a turing machine can be said to do is express an abstraction ( a function of nat -> nat) -- since it is an abstraction.

No capacities follow from expressing such a computational abstraction -- Searle thought the chinese room made this obvious to those who didnt find it so. But he was baffled that anyone didnt already find it obvious.

One could make the same point with physics, rather than with meaning. Eg., the earth orbiting the sun computes +1,-1,+1,-1 .. and so does an infinte numer of physical processes that share no properties with the earth, or the sun, etc. Thus just because we observe +1,-1,+1... does not mean that "inside the chinese physics room" there's an earth orbiting the sun. It could literally be anything.

by mjburgess

4/2/2025 at 4:43:44 PM

So what happens when I stuff an LLM inside a robot, and when I ask it to pass me the pen it passes me the pen?

by kevinventullo

4/2/2025 at 4:48:25 PM

Well no amount of measures of behaviour imply a capacity for meaning -- behaviour is a proxy measure of mental capacities.

However we might, as a practical matter, have a large number of proxy tests and treat a system as meaning-capable if it passes.

by mjburgess

4/2/2025 at 5:40:52 PM

At some point you have to stop waving your hands and hand him the pen, though. What question can be asked that can be used to distinguish genuine human intelligence from intelligence simulated by a machine?

Searle thought he had come up with just such a question, but it turned out that he hadn't.

by CamperBob2

4/2/2025 at 4:01:48 PM

>It's meant to be so obviously absurd that a person inside the room, merely substituting symbols, should in any sense understand the meaning of those symbols.

Alright, so what neuron in your brain "understands" English ?. Hell feel free to name any part regardless. This is why the Chinese Room is nonsensical. Either you admit the system can understand even when none of the constituents do or you admit you don't understand anything at all either. At least either conclusion would be consistent.

Unfortunately, we have many people take the nonsensical middle road. "Oh that doesn't understand but I certainly do, just because."

by og_kalu

4/2/2025 at 4:28:06 PM

I don't understand why people get so up in the arms about the Chinese room. It's very clear that a major part of human intelligence is a mental model of the physical world, and linguistic concepts have an (often complex) relationship to that model. There's no magic here. Nothing about that argument implies anything about neurons. The process of forming a mental model of the world and mapping words onto it could easily take place within many many neurons within the human brain, because it does! It does not take place in an LLM. That does not imply that nobody will ever develop a positronic brain that could do the same. We just clearly haven't done so yet.

Saying, "if you can't point to the neuron that does X, then you can't prove X happens" isn't a scientific perspective. It's a willfully ignorant one. If you're confident in the scientific process, then we will eventually understand how all kinds of human mental processes make sense in the context of neural networks.

by throw4847285

4/2/2025 at 5:18:12 PM

The point is that all the Chinese room is is a play to absurdity. That because opening the box reveals mechanisms that we would not call understanding does not mean the system, the Chinese room does not understand. The neuron comparison is to demonstrate that very fact. The brain is a Chinese room. It doesn't have to be relegated to a neuron, feel free to open the box and show any of us what happens in there that we would call understanding.

>It does not take place in an LLM.

I don't know what else to tell you but LLMs absolutely model concepts and the physical world, separate from the words that describe them. This has been demonstrated several times.

by og_kalu

4/2/2025 at 4:22:29 PM

The Chinese room does not aim to show, nor does it show, that part-whole relationships fail nor is it even about part-whole relationships.

Yes, neurones do not understand "pen" -- but some highly particular whole bodies do (ie., english spekaing people). That's because of highly particular relationships between those neurones, the body, the environment, and the history of that language user.

This is the csci brain rot that searle is baffled by. Symbol manipulation implies no relationships between wholes and parts. The capcity to understanding meaning requires extraordinarily specific ones.

by mjburgess

4/2/2025 at 5:11:05 PM

What is the difference between "English Speaking People" and "the Chinese Room" ? The problem with Searle's arguments is that all the Chinese room is is an appeal to absurdity, a sleight of hand. I'm supposed to think, "Oh. This is so absurd, of course the room doesn't understand" but it is an appeal that falls apart about once you realize that the same logic could be applied to any computational process, including human cognition. The distinction Searle draws between a person who genuinely understands English and a system that mechanically manipulates symbols is, in essence, arbitrary. They are both systems that have demonstrated understanding.

by og_kalu

4/2/2025 at 3:58:39 PM

> So when this room is asked, "pass me the pen" and it replies "i cannot pass the pen" (or whatever it replies) -- it should be obvious that the person in the room, or any function of their activity, has never acquired any reference to "the pen"

And yet Searle seems to pass the buck here to a book that actually "responds", not to the person in the room. I get it: the person is out of the loop.

But how does one explain the book that can answer so convincingly? That would appear to be where the "AI" resides.

by JKCalhoun

4/2/2025 at 4:23:24 PM

Just in the same way a video game is convincing. Any experimental scientist knows that the measuring device isnt what's being measured.

The TV which displays a video game outputs images as-if there were a whole world inside the TV box: there isnt.

by mjburgess

4/2/2025 at 4:45:58 PM

This point predates video games by a couple thousand years, of course.

by CamperBob2

4/2/2025 at 2:58:40 PM

Not just "some", but whopping 23% of this test participants.

by nkali

4/2/2025 at 2:35:55 PM

I wonder what will happen when the next generation of LLMs will be trained on this paper as well.

by znpy

4/2/2025 at 3:06:40 PM

They'll convince 70% of people that the singularity will be achieved with the next model released.

by goatlover

4/2/2025 at 3:27:21 PM

"I'm not afraid of a computer that passes the Turing test, I'm afraid of a computer that _deliberately_ fails it."

by yaris

4/2/2025 at 3:38:23 PM

Who's that, Bruce Lee?

by esafak

4/2/2025 at 5:28:57 PM

Alternate title: Zoomers Are Indistinguishable From LLMs.

by andai

4/2/2025 at 6:09:54 PM

Corporate LLMs are censored, so spotting one is easy: just talk about things it's not allowed to discuss.

by akomtu

4/2/2025 at 1:58:54 PM

This is wild

by dmarchand90

4/2/2025 at 3:09:00 PM

I'm surprised there was no human tested for a base reference point. I'm pretty sure some of us would not pass the test held by another human.

by roselan

4/2/2025 at 3:27:27 PM

Human win rate would be 1 minus the model win rate, to my understanding. So 77% against ELIZA, 27% against GPT-4.5 with a human persona.

by Ukv