Importance Score: 65 / 100 🔴
OpenAI’s o3 AI Model Benchmark Discrepancy Sparks Transparency Questions
Concerns are emerging regarding the transparency and model testing practices of OpenAI, prompted by a noted difference between first-party and independent benchmark results for their o3 AI model.
Initial Performance Claims and Independent Verification
In December, when OpenAI introduced o3, they asserted the model could successfully solve slightly over a quarter of problems from FrontierMath, a notoriously difficult collection of mathematical questions. This reported achievement significantly surpassed competitor models, as the next best performer managed to correctly answer only around 2% of FrontierMath problems.
During a live online broadcast, Mark Chen, OpenAI‘s chief research officer, stated, “Currently, all available solutions score below 2% [on FrontierMath]. We are observing [internally], with o3 under intensive computational testing conditions, that we can exceed 25%.”
However, it appears this figure represented a peak performance level, attained by a version of o3 utilizing greater computational resources than the iteration OpenAI publicly launched the previous week.
Independent Benchmarking by Epoch AI
Epoch AI, the research organization responsible for FrontierMath, published the outcomes of their autonomous benchmark evaluations of o3 on Friday. Epoch’s findings indicated that o3 achieved a score of approximately 10%, substantially lower than OpenAI‘s initially publicized top score.

vCard.red is a free platform for creating a mobile-friendly digital business cards. You can easily create a vCard and generate a QR code for it, allowing others to scan and save your contact details instantly.
The platform allows you to display contact information, social media links, services, and products all in one shareable link. Optional features include appointment scheduling, WhatsApp-based storefronts, media galleries, and custom design options.
OpenAI has unveiled o3, their highly anticipated reasoning model, alongside o4-mini, a more compact and economical model succeeding o3-mini.
We evaluated the new models on our suite of math and science benchmarks. Results in thread! pic.twitter.com/5gbtzkEy1B
— Epoch AI (@EpochAIResearch) April 18, 2025
This discrepancy does not necessarily imply intentional misrepresentation by OpenAI. The benchmark results initially shared by the company in December demonstrated a lower-range score consistent with Epoch’s observations. Epoch also acknowledged potential differences in testing methodologies and the use of a more recent version of FrontierMath in their assessments.
Epoch stated, “The disparity between our results and OpenAI’s could stem from OpenAI employing a more robust internal framework, utilizing greater test-time computational power, or due to evaluations conducted on a different subset of FrontierMath (the 180 problems in frontiermath-2024-11-26 versus the 290 problems in frontiermath-2025-02-28-private).”
Confirmation from ARC Prize Foundation
Echoing Epoch’s findings, a post on social media platform X from the ARC Prize Foundation, another entity that assessed a pre-release version of o3, indicated that the publicly accessible o3 model “is a different model […] optimized for chat/product application.”
ARC Prize further noted, “All released o3 compute tiers are smaller than the version we [benchmarked].” Generally, it is anticipated that larger compute tiers correlate with improved benchmark performance.
Re-testing released o3 on ARC-AGI-1 will take a day or two. Because today’s release is a materially different system, we are re-labeling our past reported results as “preview”:
o3-preview (low): 75.7%, $200/task
o3-preview (high): 87.5%, $34.4k/taskAbove uses o1 pro pricing…
— Mike Knoop (@mikeknoop) April 16, 2025
OpenAI Acknowledges Optimization for Practical Use
Wenda Zhou, a technical staff member at OpenAI, acknowledged in a recent online presentation that the production version of o3 is “more optimized for real-world use scenarios” and speed, in contrast to the o3 version showcased in December. He suggested this optimization might lead to benchmark “discrepancies.”
Zhou elaborated, “[W]e’ve implemented [optimizations] to enhance the [model’s] cost-effectiveness [and] overall utility.” He further stated, “We still anticipate – and believe – that this remains a superior model […]. Users will experience reduced wait times when seeking responses, a significant consideration with these [model] types.”
Benchmark Context and Industry Trends
Ultimately, the fact that the public release of o3 does not fully achieve OpenAI’s initial testing claims may be somewhat inconsequential, given that OpenAI‘s o3-mini-high and o4-mini models outperform o3 on FrontierMath, and the company intends to launch a more potent o3 variant, o3-pro, in the near future.
Nonetheless, it serves as another reminder to interpret AI benchmarks with caution—especially when originating from companies promoting their own products.
Benchmark “controversies” are becoming increasingly prevalent within the AI sector as companies vie for media attention and market prominence with their latest models.
Prior Benchmark-Related Incidents in the AI Field
Earlier this year, Epoch faced criticism for delaying the disclosure of funding received from OpenAI until after the o3 announcement. Many academics who contributed to FrontierMath were unaware of OpenAI‘s involvement until it was publicly revealed.
More recently, Elon Musk’s xAI organization faced accusations of disseminating potentially misleading benchmark charts for their latest AI model, Grok 3. Just this month, Meta conceded to promoting benchmark scores for a model version different from the one made available to developers.