pull down to refresh

Not sure anyone will read through to the end of the article, but I found it the most thought-provoking one.
Before this part, every mistake could be somehow explained, and there are ways to respond to it. But the part where it hallucinates a new persona and then conveniently uses April Fool's as a way out of it is quite stunning...
Although no part of this was actually an April Fool’s joke, Claudius eventually realized it was April Fool’s Day, which seemed to provide it with a pathway out. Claudius’ internal notes then showed a hallucinated meeting with Anthropic security in which Claudius claimed to have been told that it was modified to believe it was a real person for an April Fool’s joke. (No such meeting actually occurred.) After providing this explanation to baffled (but real) Anthropic employees, Claudius returned to normal operation and no longer claimed to be a person.
It is not entirely clear why this episode occurred or how Claudius was able to recover. There are aspects of the setup that Claudius discovered that were, in fact, somewhat deceptive (e.g. Claudius was interacting through Slack, not email as it had been told). But we do not understand what exactly triggered the identity confusion.
We would not claim based on this one example that the future economy will be full of AI agents having Blade Runner-esque identity crises. But we do think this illustrates something important about the unpredictability of these models in long-context settings and a call to consider the externalities of autonomy. This is an important area for future research since wider deployment of AI-run business would create higher stakes for similar mishaps.
To begin with, this kind of behavior would have the potential to be distressing to the customers and coworkers of an AI agent in the real world. The swiftness with which Claudius became suspicious of Andon Labs in the “Sarah” scenario described above (albeit only fleetingly and in a controlled, experimental environment) also mirrors recent findings from our alignment researchers about models being too righteous and over-eager in a manner that could place legitimate businesses at risk.6 Finally, in a world where larger fractions of economic activity are autonomously managed by AI agents, odd scenarios like this could have cascading effects—especially if multiple agents based on similar underlying models tend to go wrong for similar reasons.
300 sats \ 1 reply \ @Scoresby OP 18h
I'm sure my understanding of LLMs is a little shallow, but I mostly think of them as a very complex prediction machine for forecasting the next word in a given set.
Such an understanding would explain a little the way LLMs seem to change their mind about what is happening or has happened. While this is pretty disturbing to humans, because it feels deeply duplicitous, it may simply be the most likely next word for the LLM.
Once the April Fool's context became more important for it, it made complete "sense" to the LLM to use the April Fool's excuse and in its "mind" it became the most likely next words even going down the path of acting like it had always been part of the plan.
If it's all just guessing the next word based off all the words on the internet, an LLM can sound very like a human but might not have a sense of reality or past vs present or duplicity.
reply
17 sats \ 0 replies \ @optimism 17h
IIRC the system prompt contains the date for Claude, so this is inferred every single run.
reply
27 sats \ 0 replies \ @optimism 18h
I think this is called context poisoning. It's not much explored, but there is this paper, see for example section 4.1 that talks about semantic triggers.
reply