Thinking AI models collapse in face of complex problems, Apple researchers find
These findings, ahead of WWDC 2025, from Apple’s own researchers, may suggest that any AI reasoning announcements will likely be pragmatic and focused on specific use-cases
Just days ahead of the much-anticipated Worldwide Developer Conference (WWDC), Apple has released a study titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity”, which saw researchers testing ‘reasoning’; AI models such as Anthropic’s Claude, OpenAI’s o models, DeepSeek R1 and Google’s Thinking models to see how far they can scale to replicate human reasoning. Spoiler alert — not as much, as the entire AI marketing pitch, would have you believe. Could this signal what may be in store for Apple’s AI conversation ahead of the keynote?

The study questions the current standard evaluation of Large Reasoning Models (LRMs) using established mathematical and coding benchmarks, arguing they suffer from data contamination and don’t reveal insights into reasoning trace structure and quality. Instead, it proposes a controlled experimental testbed using algorithmic puzzle environments. The limitations of AI benchmarking, and need to evolve, is something we had written about earlier.
“We show that state-of-the-art LRMs (e.g., o3-mini, DeepSeek-R1, Claude-3.7-Sonnet-Thinking) still fail to develop generalizable problem-solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities across different environments,” the researcher paper points out. These findings are a stark warning to the industry — current LLMs are far from general-purpose reasoners.
The emergence of Large Reasoning Models (LRMs), such as OpenAI’s o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking, has been hailed as a significant advancement, potentially marking steps toward more general artificial intelligence. These models characteristically generate responses following detailed “thinking processes”, such as a long Chain-of-Thought sequence, before providing a final answer. While they have shown promising results on various reasoning benchmarks, the capability of benchmarks to judge rapidly evolving models, itself is in doubt.
The researchers cite a comparison between non-thinking LLMs and their ‘thinking’ evolution. “At low complexity, non-thinking models are more accurate and token-efficient. As complexity increases, reasoning models outperform but require more tokens—until both collapse beyond a critical threshold, with shorter traces,” they say. The illustrative example of the Claude 3.7 Sonnet and Claude 3.7 Sonnet Thinking illustrates how both models retain accuracy till complexity level three, after which the standard LLM sees a significant drop, something the thinking model too suffers from, a couple of levels later. At the same time, the thinking model is using significantly more tokens.
This research attempted to challenge prevailing evaluation paradigms, which often rely on established mathematical and coding benchmarks, which are otherwise susceptible to data contamination. Such benchmarks also primarily focus on final answer accuracy, providing limited insight into the reasoning process itself, something that is the key differentiator for a ‘thinking’ model compared with a simpler large language model. To address these gaps, the study utilises controllable puzzle environments — Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World — and these puzzles allow for precise manipulation of problem complexity while maintaining consistent logical structures and rules that must be explicitly followed. That structure theoretically opens a window, a glance at how these models attempt to “think”.
The findings from this controlled experimental setup reveal significant limitations in current frontier LRMs. One of the most striking observations is the complete accuracy collapse that occurs beyond certain complexity thresholds across all tested reasoning models. This is not a gradual degradation but a sharp drop to near-zero accuracy as problems become sufficiently difficult.
“The state-of-the-art LRMs (e.g., o3-mini, DeepSeek-R1, Claude-3.7-Sonnet-Thinking) still fail to develop generalizable problem-solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities across different environments,” note the researchers.
These results inevitably challenge any notion that the LRMs truly possess generalisation problem-solving skills, required for planning tasks or multi-step processes. The study also identifies a counter-intuitive scaling limit in the models’ reasoning effort (this is measured by the inference token usage during the “thinking” phase), which sees these models initially spend more tokens, but as complexity increases, they actually reduce reasoning effort closer to the inevitable accuracy collapse.
Researchers say that “despite these claims and performance advancements, the fundamental benefits and limitations of LRMs remain insufficiently understood. Critical questions still persist: Are these models capable of generalizable reasoning, or are they leveraging different forms of pattern matching?,” they ask. There are further questions pertaining to performance scaling with increasing problem complexity, comparisons to the non-thinking standard LLM counterparts when provided with the same inference token compute, and around inherent limitations of current reasoning approaches, as well as improvements that might be necessary to advance toward more robust reasoning.
Where do we go from here?
The researchers make it clear that their test methodology too has limitations. “While our puzzle environments enable controlled experimentation with fine-grained control over problem complexity, they represent a narrow slice of reasoning tasks and may not capture the diversity of real-world or knowledge intensive reasoning problems,” they say. They do add that the use of “deterministic puzzle simulators assumes that reasoning can be perfectly validated” at every step, a validation that may not be feasible to such precision in less structured domains. That they say, would restrict validity of analysis to more reasoning.
There is little argument that LRMs represent progress, particularly for the relevance of AI. Yet, this study highlights that not all reasoning models are capable of robust, generalisable reasoning, particularly in the face of increasing complexity. These findings, ahead of WWDC 2025, and from Apple’s own researchers, may suggest that any AI reasoning announcements will likely be pragmatic. The focus areas could include specific use cases where current AI methodology is reliable (the research paper indicates lower to medium complexity, less reliance on flawless long-sequence execution) and potentially integrating neural models with traditional computing approaches to handle the complexities where LRMs currently fail. The era of Large Reasoning Models is here, but this “Illusion of thinking” study is that AI with true reasoning, remains a mirage.