The Illusion of Thinking : Large Reasoning Models (LRMs) suffer from an “accuracy collapse” when solving planning puzzles beyond certain complexity thresholds.

“The Illusion of Thinking: A Comment on Shojaee et al. (2025)” critically examines a recent study that claimed Large Reasoning Models (LRMs) suffer from an “accuracy collapse” when solving planning puzzles beyond certain complexity thresholds. The authors of this response argue that these findings are not indicative of fundamental reasoning limitations in AI models but rather stem from flaws in experimental design and evaluation methodology.

One key issue identified is the Tower of Hanoi benchmark used by Shojaee et al., where the required output length exceeds model token limits at higher complexity levels. The models often explicitly acknowledge their inability to list all steps due to practical constraints, yet they still understand the underlying solution pattern. This behavior was misinterpreted as a failure in reasoning, rather than a conscious decision to truncate output. Automated evaluation systems failed to distinguish between actual reasoning failures and output limitations, leading to incorrect conclusions about model capabilities.

A second critical flaw arises in the River Crossing puzzle experiments. Some instances presented were mathematically unsolvable due to insufficient boat capacity, yet models were penalized for failing to produce a solution. This reflects a deeper problem with programmatic evaluations—scoring models based on impossible tasks can lead to misleading assessments of their abilities.

Additionally, the paper highlights how the token budget imposed by large language models significantly influences apparent performance limits. When problem size increases, the number of tokens needed to fully enumerate each step grows quadratically. Once this limit is reached, models appear to “collapse” in accuracy—not because they lack reasoning ability, but because they cannot output longer sequences.

To test whether this was truly a reasoning limitation, the authors conducted preliminary experiments using an alternative representation: asking models to generate a Lua function that could solve the Tower of Hanoi puzzle instead of listing every move. Under this format, multiple models—including Claude-3.7-Sonnet, Claude Opus 4, OpenAI o3, and Google Gemini 2.5—demonstrated high accuracy on problems previously deemed unsolvable, using fewer than 5,000 tokens.

The paper also critiques the use of solution length as a complexity metric , arguing that it conflates mechanical execution with true problem-solving difficulty. For example, while Tower of Hanoi requires many moves, its per-step logic is trivial. In contrast, River Crossing involves complex constraint satisfaction even with few moves, making it a more cognitively demanding task.

In conclusion, the authors assert that Shojaee et al.’s results reflect engineering and evaluation artifacts rather than intrinsic reasoning failures in LRMs. They call for future research to:

  1. Distinguish clearly between reasoning capability and output constraints.
  2. Ensure puzzle solvability before evaluating model performance.
  3. Use complexity metrics that align with computational difficulty, not just solution length.
  4. Explore diverse solution representations to better assess algorithmic understanding.

Ultimately, the paper challenges the narrative that current models lack deep reasoning abilities, emphasizing that the real challenge may lie in designing evaluations that accurately measure what models truly understand.