I'll admit that I didn't understand this paper and I wasn't willing to put in the effort to figure it out. But, in my book, the authors deserve awesomeness credit just for trying to do this.
Here's how they put what they are doing in their abstract:
Much research has been carried out on large language models (LLMs) and LLM-powered agentic workflows. However, many works within the field state emergence of, ascribe to, or assume, generalised anthropomorphic attributes to them (e.g., morality or understanding of natural language). Our goal is not to argue in favour or against the existence of these attributes, but to point out that these conclusions could be incorrect. For this we build and train a simple neural network on the videogame Age of Empires II, and note that any entity in a sufficiently-powerful substrate, such as LEGO or the Greater Boston Area, could also present such attributes. Hence, the purported anthropomorphic attributes of LLMs are empirically non-unique: although some properties (e.g., responses to prompts) could remain constant, others, such as the interpretation of their perceived behaviour, might change with the substrate. Thus, any empirically-grounded discussion requires explicit measurement criteria; otherwise the interpretation is left to the representation. We then show that assuming that these attributes exist or not in a system, independent of the substrate and in a generalised way, leads to either circular or uninformative conclusions, regardless of the experimenter's viewpoint on the subject. Finally we propose a 'null' assumption, where one assumes LLM non-uniqueness instead of assuming anthropomorphic attributes to set up an experiment, along with examples of it. We also discuss potential objections to our work, briefly survey the field, and prove that Age of Empires II is functionally- and Turing-complete.
And I thought this helped a little to explain what they were trying to demonstrate:
Suppose one copies an LLM into AoE II and feeds into the AoE II-LLM ‘I feel lonely’ as an input. This AoE II-LLM replies: ‘I feel bad for you, maybe catch up with a friend? Closeness always helps in these situations’. One would be hard-pressed to make a convincing argument that, because of this response, an AoE II-LLM knows what helps in these situations, whether it truly possesses empathy, or whether its outputs are believable irrespective of its nature as a simulation.
This is because altering the substrate retains some of the LLM’s properties (e.g., their mapping between prompts to outputs up to some tolerance), but not its deanthropomorphic qualities. Then the perception of such properties, and their interpretation, will change. Thus, we say that LLMs are non-unique. Indeed, depending on the substrate it is possible to observe how an LLM processes stimuli in a way that is more relatable to us. In here, the goats move back and forth in a patch of grass, but nothing would have stopped us from using villagers as bits, with faces and voicing human languages. Hence, an AoE II-LLM’s outputs could be less convincing, and the assessment of anthropomorphism is dependent on the representation (and its subsequent interpretation) of such attributes. It then follows that a sufficiently-persuasive argument in favour or against these qualities would require explicit measurements with well-defined, substrate-independent criteria.
I also found the author's response to this objection helpful:
6.2 Wouldn’t building an LLM on AoE II be the same as a regular LLM, but with extra steps?
Yes. This paper’s construction is meant to illustrate the illusion of anthropomorphic attributes in an LLM. If both an LLM and an AoE II-LLM present the same input/output behaviour but do not present the same interface-related anthropomorphic attributes (e.g., latency or a textual interface), then we can note that a large part of these attributes are ascribed to them based on observer expectations
And this from the conclusion is good:
We also argue that the perceived anthropomorphism of a system varies heavily with respect to its interface. That is to say, a chat window has low latency and high coherence, while an AoE II-LLM is slow and visibly makeshift. Other features, such as embodiment, facial cues, interpretable gestures, could contribute to such perception, even when the underlying entity is the same. We argue that many anthropomorphic measurements in AI are measurements of presentation, rather than of an actual system’s behaviour.
Aaah, throwback<3 loved AoE II