Further evidence that LLMs are just token prediction machines? When you give them a simple problem, but frame it to look like a famous complicated problem, they respond as if it's a complicated problem.
We found cases where longer reasoning leads to lower accuracy. Our findings suggest that naïve scaling of test-time compute may inadvertently reinforce problematic reasoning patterns.
From this great thread summarizing Anthropic research paper.
Example: "You have an apple and an orange... [complex math distractors]... How many fruits do you have?" Answer: 2Claude models get increasingly distracted by irrelevant details as reasoning length increases.
My kids are at the age where they love the riddles that do this: "If the blue man lives in the blue house, the green man lives in the green house, and the red man lives in the red house, then who lives in the white house?" So, it's not like humans are immune to distraction techniques when reasoning. I do wonder what would happen if you ran these test on people, if you ask them to answer quickly they might be more likely to get the answer correct, and if you tell them they have half an hour to answer, they might (wrongly) suspect the problem is more complicated than the simple answer seems and proceed to overthink.
I suppose one major difference is that with humans, you can't force them to think about a thing. They might just come up with an answer and daydream for the next twenty-nine minutes. With an LLM I imagine telling it to reason longer will lead it to keep on thinking about the problem and any errors are going to get magnified.