pull down to refresh

We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) intwo randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time—not significantly more or less often than the humans they were being compared to—while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by Large Language Models (LLMs), and the social and economic impacts these systems are likely to have.
(emphasis mine)
As I checked for a published version (it hasn't), I stumbled on this general audience article on the topic in case the hedged language from the arXiv is not your thing.
The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test
Wow, really surprised it hadn't been done before.
I've long since thought we were past the Turing Test. Now we just need a new definition of AGI.
reply
Surprising indeed. I thought it'd have been one of the obvious benchmarks people would try to test. I guess they did test it, but it's the first time it passes.
reply
On the contrary, the reason why A “small” language model—by which we generally mean one with on the order of millions (rather than billions) of parameters—struggles to mimic human conversation convincingly for several interlocking reasons:
  1. Limited Representational Capacity
Fewer parameters mean the model can store and manipulate far less information about language patterns, world knowledge, and subtle linguistic nuances.
As a result, it often resorts to simplistic or repetitive responses, rather than the rich variety of expression a human would use.
  1. Poor Generalization and Coherence
Small models tend to overfit to the specific data they were trained on, so they struggle with novel topics or unexpected turns in conversation.
They lack the depth to maintain a consistent persona, long-term context, or coherent thread over multiple turns, making their dialogue feel disjointed or “robotic.”
  1. Limited World Knowledge and Reasoning
Passing a Turing test usually requires not just fluent language but also common-sense reasoning, up‑to‑date facts, and the ability to draw inferences.
With constrained capacity, small models cannot internalize large-scale factual databases or sophisticated reasoning patterns; they often hallucinate or give incorrect answers when pressed.
  1. Surface‑Level Pattern Matching
At their core, small LMs are powerful pattern‑matchers but lack the deeper latent structures (e.g., causal models, theory of mind) that larger models can approximate.
This leads to responses that may look grammatically correct but fail to capture intentions, emotions, or the pragmatic subtleties of human dialogue.
Implications for the Turing Test
Alan Turing’s original proposal envisioned an interlocutor capable of sustained, varied, and contextually appropriate conversation. Small language models simply don’t have the “brain‑like” resources—be it memory, breadth of knowledge, or reasoning scaffolds—to convincingly impersonate a human over an extended exchange. In short, they lack both the scale and depth required to fool a well‑informed judge.
reply
I prefer original thoughts.
reply
Sure, we all do but Ai is taking over the world, courtesy of the human knowledge cos we actually programmed the Ai to function the way it does. 🤷‍♂️
reply
What do you get, personally, out of copy-pasting this kind of text from an LLM? Genuine question. I really don't understand. Would you have done it in absence of the incentive of sats? Are you hoping to start a discussion?
I stopped reading as soon as i realised it was AI.
reply
Yeah sure the territory is AI, lol
reply