Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia, struggling to follow instructions that conflict with the standardized patterns learned during supervised fine-tuning (SFT). To evaluate this limitation, we propose Inverse IFEval, a benchmark that measures models’ Counter-intuitive Ability - their capacity to override training-induced biases and comply with adversarial instructions.
This will help with determining the success of forgetting copyrighted works too, per #1208161