Fig. 7a demonstrates the relation between the position of intermediate solutions within thoughts, their
correctness, and problem complexity across all puzzle environments.
As problems become moderately more complex,
this trend reverses: models first explore incorrect solutions and mostly later in thought arrive at
the correct ones. This time the distribution of incorrect solutions (red) is shifted more downward
compared to correct ones (green). Finally, for the problems with higher complexity, collapse emerges,
Figure 7: Left & Middle: Position and correctness of intermediate solutions within reasoning traces
across four puzzles at varying complexity levels. ✓ indicates correct solutions, ✗ indicates incorrect
solutions, with distribution density shown by shading; Right: Solution accuracy versus position
in thinking for Tower of Hanoi at different complexity levels. Simple problems (N=1-3) show early
accuracy declining over time (overthinking), moderate problems (N=4-7) show slight improvement
in accuracy with continued reasoning, and complex problems (N≥8) exhibit consistently near-zero
accuracy, indicating complete reasoning failure.
Our detailed analysis of reasoning traces further exposed complexitydependent reasoning patterns, from inefficient “overthinking” on simpler problems to complete failure
on complex ones. These insights challenge prevailing assumptions about LRM capabilities and
suggest that current approaches may be encountering fundamental barriers to generalizable reasoning.