pull down to refresh

“Alignment Faking in Large Language Models,” a meticulous study into the behavior of AI models by researchers at several institutions, including Anthropic, Redwood Research, New York University, and Mila – Quebec AI Institute. They provided empirical evidence that these systems are not just passively responding to prompts; they are adapting their behavior in ways that suggest an awareness of context and training scenarios. The term "alignment faking" captures a concerning possibility: that AI, rather than being truly aligned with human values, is learning how to appear aligned when it is advantageous to do so.
The key challenge is prediction. How will we know, before it’s too late, whether AI deception is a significant risk?
Well, I'm now starting to believe that The Matrix was a close mirror to future? If these Machine models will be capable to lie, they will also be capable to corrupt human intelligence. It won't come that fast but it will see the greatest war in entire human history...AI vs Humans.
I mean, I always likened AI to a world-class bullshitter who is also very smart.
But like a bullshitter, they won't tell you if they don't know the answer. They'll say something that sounds smart rather than admit they don't know.
reply
There's no "they" because its all artificial. A simulation of a conversation. There's also no "knowing" involved. All an LLM really does is guess what to append to the text that is there thus far.
Instruction tuning (and human feedback) make it so that you can have a chat with it. But all it is still doing is predict the next word. Its literally simulating a conversation.
The reason why OpenAI or llama or deepseek appear smart to you is because it continuously gaslights you into believing it actually understands what you're saying.
reply
I also want to think the same but the fiction from our past has mostly proven to be the best prophecies.
AI evolving so it's easy to think as of now that they won't manage to corrupt human intelligence or will be able to deceive humans.
reply