OpenAI’s new reasoning AI models hallucinate more

Importance Score: 75 / 100 🔴

OpenAI’s New AI Models Show Increased ‘Hallucinations’ in Testing

The latest artificial intelligence models from OpenAI, specifically o3 and o4-mini, have demonstrated cutting-edge capabilities in numerous areas. However, these novel reasoning models exhibit a notable drawback: they still hallucinate, or generate fabricated information. Testing reveals these advanced systems, in fact, hallucinate more frequently than several of OpenAI’s earlier generation models, presenting an unexpected challenge in AI development.

The Persistent Problem of AI Hallucinations

Hallucinations have emerged as a significant and complex obstacle in the field of artificial intelligence, affecting even the most sophisticated systems currently available. Historically, each successive AI model generation has typically shown marginal improvements in mitigating hallucinations, exhibiting fewer instances of creating false information compared to their predecessors. However, this trend appears to have deviated with the introduction of o3 and o4-mini.

Unexpected Results in Internal Testing

According to internal evaluations conducted by OpenAI, the o3 and o4-mini models, categorized as reasoning models, exhibit a higher propensity to hallucinate compared to the company’s prior reasoning models. This includes models such as o1, o1-mini, and o3-mini, as well as OpenAI’s conventional, “non-reasoning” models, like GPT-4o. This finding raises questions about the anticipated progress in reducing inaccuracies in newer AI iterations.

Uncertainty Surrounds Cause of Increased Hallucinations

Adding to the complexity, the creators of ChatGPT acknowledge that the underlying reasons for this increase in hallucinations are not fully understood.

Need for Further Research

In their technical documentation for o3 and o4-mini, OpenAI states that “more research is required” to gain a comprehensive understanding of why hallucinations are becoming more pronounced as reasoning models are scaled up. While o3 and o4-mini demonstrate enhanced performance in certain domains, such as coding and mathematical tasks, their tendency to “make more claims overall” inadvertently leads to both “more accurate claims as well as more inaccurate/hallucinated claims,” as detailed in the report.

vCard QR Code

vCard.red is a free platform for creating a mobile-friendly digital business cards. You can easily create a vCard and generate a QR code for it, allowing others to scan and save your contact details instantly.

The platform allows you to display contact information, social media links, services, and products all in one shareable link. Optional features include appointment scheduling, WhatsApp-based storefronts, media galleries, and custom design options.

Benchmark Testing Reveals Higher Hallucination Rates

OpenAI’s assessments indicated that o3 hallucinated in response to 33% of inquiries on PersonQA, their internal benchmark used to gauge the accuracy of a model’s knowledge regarding individuals. This hallucination rate is approximately double that of OpenAI’s earlier reasoning models, o1 and o3-mini, which registered scores of 16% and 14.8%, respectively. O4-mini performed even less favorably on PersonQA, exhibiting hallucinations in 48% of cases.

Third-Party Validation of Hallucination Tendency

Independent testing conducted by Transluce, a non-profit AI research organization, also provided evidence suggesting that o3 tends to fabricate actions purportedly undertaken during the process of generating answers. In one specific instance, Transluce observed o3 asserting that it executed code on a 2021 MacBook Pro “outside of ChatGPT,” and subsequently incorporated the numerical results into its response. While o3 possesses access to certain tools, it lacks the capability to perform such actions.

Hypotheses on Reinforcement Learning Impact

Neil Chowdhury, a Transluce researcher and former OpenAI employee, proposed that “the type of reinforcement learning employed for o-series models might amplify issues that are typically mitigated (but not completely eliminated) by standard post-training procedures.”

Concerns About Model Utility

Sarah Schwettmann, co-founder of Transluce, suggested that o3’s hallucination rate could potentially diminish its practical utility compared to its potential capabilities were accuracy improved.

Real-World Testing and Link Hallucinations

Kian Katanforoosh, a Stanford adjunct professor and CEO of Workera, an upskilling startup, reported that his team is actively evaluating o3 within their coding workflows. They have observed that o3 represents an advancement over competing models. However, Katanforoosh noted that o3 exhibits a tendency to hallucinate broken website links. The model sometimes provides links that, when accessed, are non-functional.

Implications for Practical Applications

While hallucinations might contribute to models generating novel concepts and exhibiting creative “thinking,” they also present challenges for businesses in sectors where precision is critical. For instance, legal firms would likely find limited utility in a model prone to inserting factual inaccuracies into client contracts.

Potential Solutions: Web Search Integration

One promising approach to enhance the accuracy of these models involves incorporating web search functionalities. OpenAI’s GPT-4o, equipped with web search, achieves a 90% accuracy rate on SimpleQA, another of OpenAI’s benchmarks for assessing accuracy. Integrating search capabilities could potentially reduce hallucination rates in reasoning models, particularly in scenarios where users are comfortable exposing prompts to external search providers.

Urgency in Addressing Hallucinations

If the trend of increasing hallucinations continues as reasoning models are further scaled, the imperative to discover a solution becomes even more pressing.

OpenAI’s Ongoing Commitment to Accuracy

“Addressing hallucinations across all our models is an ongoing area of research, and we are continually striving to improve their accuracy and reliability,” stated OpenAI spokesperson Niko Felix.

Industry Shift Towards Reasoning Models

Over the past year, the broader AI industry has increasingly focused on reasoning models, as methods to enhance conventional AI models began to yield diminishing improvements. Reasoning capabilities improve model performance across a range of tasks without demanding substantial computational resources and data during training. However, the emergence of increased hallucinations associated with reasoning presents a significant challenge that requires careful consideration and resolution.


🕐 Top News in the Last Hour By Importance Score

# Title 📊 i-Score
1 Revealed: The WORST area in England to call an ambulance…where you'll be waiting over an hour for paramedics 🔴 78 / 100
2 Five dead as huge waves hit Australia coast 🔴 75 / 100
3 TechCrunch Mobility: Lyft buys its way into Europe, Kodiak SPACs, and how China’s new ADAS rules might affect Tesla 🔴 70 / 100
4 Florida lawyer who called himself the 'most trusted attorney in town' had a shocking $30,000 sex secret 🔴 65 / 100
5 NASA discovery linked to Jesus' crucifixion 'reveals exact day Christ died' 🔵 45 / 100
6 UK couple who died in Italy cable car crash named 🔵 45 / 100
7 Oblivion Remake Screens Leak, Nintendo Shows Off Mario Kart World, And More Top Stories 🔵 42 / 100
8 Why ‘Gilmore Girls’ creators Amy Sherman-Palladino and Dan Palladino think their shows are a hit with audiences 🔵 40 / 100
9 WH Smith bets on travel arm after High St sell-off ahead of a sale of its retail division to private equity 🔵 25 / 100
10 Hugh Jackman takes shock swipe at Deadpool & Wolverine costar Ryan Reynolds 🔵 25 / 100

View More Top News ➡️