
The latest results from FrontierMath, a standard test for relational AI on innovative math problems, display OpenAI’s o3 model performed worse than OpenAI previously stated. While newer OpenAI versions now outperform o3, the gap highlights the need to analyze AI benchmarks carefully.
Epoch AI, the research academy that created and provides the check, released its latest results on April 18.
OpenAI claimed 25 % implementation of the check in December
Last month, the FrontierMath rating for OpenAI o3 was part of the almost overpowering number of presentations and offers released as part of OpenAI’s 12-day holiday celebration. The firm claimed OpenAI o3, then its most potent argument design, had solved more than 25 % of problems on FrontierMath. In contrast , most rival AI models scored around 2 %, according to TechCrunch.
Notice: For Earth Day, organizations was factor conceptual AI’s power into their conservation efforts.
On April 18, Epoch AI released evaluation outcomes showing OpenAI o3 scored closer to 10 %. But, why is there such a big difference? Both the design and the evaluation could have been various back in December. The type of OpenAI o3 that had been submitted for measuring last year was a prerelease type. FrontierMath itself has changed since December, with a diverse range of math issues. This is n’t obviously a warning not to believe measures; alternatively, just remember to delve into the type numbers.
OpenAI o4 and o3 small score highest on fresh FrontierMath benefits
The updated results show OpenAI o4 with reasoning performed best, scoring between 15 % and 19 %. It was followed by OpenAI o3 small, with o3 in next. Another positions include:
- OpenAI o1
- Grok-3 little
- Claude 3. 7 Sonnet ( 16K)
- Grok-3
- Claude 3. 7 Sonnet ( 64K)
Although Epoch AI freely administers the test, OpenAI actually commissioned FrontierMath and owns its information.
Accusations of AI benchmarking
Benchmarks are a popular way to compare conceptual AI types, but reviewers say the results may be influenced by analyze style or lack of transparency. A July 2024 research raised concerns that benchmarks often overemphasize small process precision and suffer from non-standradized analysis practices.