Are ‘Reasoning’ Models Really Smarter Than Other LLMs? Apple Says No

According to a document from Apple researchers, generational AI versions with “reasoning” does not actually solve specific problems in comparison to regular LLMs.

Also conceptual AI’s creators are unsure of how it functions. Maybe they refer to the puzzle as an accomplishment as proof that they are conducting research beyond the scope of human comprehension. The Apple staff made an effort to unravel some of the puzzle by looking at the “internal logic traces” that drive how LLMs operate.

The researchers concentrated on argument types like OpenAI o3 and Anthropic’s Claude 3.7 Sonnet Thinking, which generate a chain of thought and an argument of their own argument before coming up with an answer.

These models can struggle with extremely complex problems, according to their findings, which eventually cause their accuracy to fully decline, generally worse than that of simpler models.

In some assessments, conventional models outperform logic models.

Standard models perform better on low-complexity tasks than logic models, but logic models perform better on medium-complexity tasks, according to the research papers. The experts set up the most challenging models because neither kind may accomplish them.

Because the group wanted to avoid leakage from coaching data and create controlled check conditions, the researchers wrote, those tasks were puzzles, and they were chosen as benchmarks instead.

Notice: Qualcomm intends to buy UK business Alphawave for$ 2.4 billion in order to expand into the market for date centers and AI.

Alternatively, Apple tested logic designs on mysteries like the Tower of Hanoi, which involves arranging disks of various dimensions on three pegs. Reasoning models actually had lower accuracy than normal large language models when it came to solving simpler puzzles.

On average puzzles, logic models performed significantly better than conventional LLMs. Reasoning models were unable to solve the puzzle at any point, even when an algorithm was provided to them, even at more challenging versions ( eight disks or more ). Reasoning models could hardly extrapolate far enough to solve the more difficult ones and had “overthink” the simpler ones.

In order to compare models with the same underlying infrastructure, they tested Anthropic’s Claude 3. 7 Verse with and without reasoning as well as DeepSeek R1 vs. DeepSeek R3 to examine models.

Logic types have the ability to “overthink.”

This ability to solve particular puzzles suggests that logic models don’t work as efficiently.

Non-thinking versions are more precise and token-efficient at small complexity. Argument models outperform but demand more tokens as richness rises until both decline beyond a critical threshold, with shorter traces, according to the researchers.

Reasoning models may “overthink,” spending time pursuing wrong ideas even after they have already discovered the right solution.

LRMs have limited self-correction abilities, which, according to the authors, reveal underlying errors and obvious scaling limitations.

Additionally, the researchers noted that performance on tasks like the River Crossing puzzle may have been hindered by a lack of comparable examples in the woman’s training data, which limited their ability to generalize or purpose through book variations.

Is there a plateau in the development of relational AI?

In a related report on the limitations of massive vocabulary models for math that Apple researchers published in 2024, they suggested that AI math benchmarks were inadequate.

There are tips that conceptual AI advancements may have reached their limitations in the industry. Coming releases might focus on minor changes rather than significant changes. Depending on your use event, OpenAI’s GPT-5 may combine existing models into a more usable UI, but it might not be a significant upgrade.

Apple has been putting together conceptual AI features on its products despite holding its Worldwide Developers Conference this year.

Source credit

What's Hot

Newsom and Padilla Get Skewered by One of Their Own

Indian-origin businessman Sabeer Bhatia defends his ‘insensitive’ comments about Air India crash; says it’s ‘outpouring of fake emotions’

Report: L.A. Riots Force ‘Rescheduling’ Of 600 Veteran Health Care Appointments

Are ‘Reasoning’ Models Really Smarter Than Other LLMs? Apple Says No

First Known ‘Zero-Click’ AI Exploit: Microsoft 365 Copilot’s EchoLeak Flaw

The Meta AI App Lets You ‘Discover’ People’s Bizarrely Personal Chats

NVIDIA Expands AI Dominance in Europe with Major Partnerships and Infrastructure Deals

Unpacking AI Agents

Gartner: This GenAI Apps Development Strategy Could Cut Delivery Time by 50%

Gartner: This GenAI Apps Development Strategy Could Cut Delivery Time by 50%

Newsom and Padilla Get Skewered by One of Their Own

Indian-origin businessman Sabeer Bhatia defends his ‘insensitive’ comments about Air India crash; says it’s ‘outpouring of fake emotions’

Report: L.A. Riots Force ‘Rescheduling’ Of 600 Veteran Health Care Appointments

‘Expect several waves of Iranian attacks’: Netanyahu after Israeli strikes; US moves destroyers to region

Trump says he ‘always knew’ details of Israel’s attack on Iran

H-1B visa: East Bay company agrees to fine over alleged discrimination against US workers

National Guard troops will stay under Trump’s control, for now, under 9th Circuit order

Trump halts efforts to tear out lower Snake River dams in Washington and Idaho

Videos: Democrat senator forcibly removed from press briefing

Appreciation: Brian Wilson, dead at 82. ‘I never knew what “genius” meant,’ he told us

What's Hot

Are ‘Reasoning’ Models Really Smarter Than Other LLMs? Apple Says No

In some assessments, conventional models outperform logic models.

Logic types have the ability to “overthink.”

Is there a plateau in the development of relational AI?

Keep Reading

Sign up for the Conservative Insider Newsletter.