TLDRs;
- OpenAI warns that accuracy-driven metrics push AI models to guess instead of admitting uncertainty, fueling hallucinations.
- New research shows abstaining models produce fewer errors but are penalized in current leaderboards, worsening AI reliability.
- Hallucinations pose risks for industries like healthcare, finance, and law, where mistakes can trigger penalties and erode trust.
- OpenAI’s GPT-5 reduces hallucinations, but experts say evaluation standards must evolve to reward honesty over confident errors.
OpenAI has sounded the alarm over how current evaluation standards in artificial intelligence may be directly contributing to the widespread problem of hallucinations.
In a newly released research paper, the company argues that benchmarks focused almost exclusively on accuracy incentivize large language models (LLMs) to guess answers rather than acknowledge uncertainty, often leading to confidently wrong outputs.
Hallucinations, defined as false but persuasive AI-generated statements, remain one of the most pressing challenges in the development of reliable AI systems. While LLMs have advanced rapidly in reasoning, coding, and creative writing, OpenAI warns that progress is being undermined by the very metrics used to measure success.
Research reveals flawed incentives
The study compared two versions of OpenAI models with starkly different results. One model, GPT-5-thinking-mini, abstained from answering 52% of the time when uncertain, producing only a 26% error rate. Another, o4-mini, rarely abstained, but recorded a staggering 75% error rate.
This mismatch demonstrates a core flaw in evaluation frameworks. Much like a multiple-choice test that rewards guessing, accuracy-based leaderboards score abstentions as failures. As a result, models that “play it safe” are ranked lower despite producing fewer falsehoods, while those that guess more aggressively rise higher.
OpenAI suggests a more nuanced approach of penalizing confident but incorrect answers and granting partial credit when a model admits uncertainty. Such a shift, the company believes, would encourage AI systems to behave more responsibly and align better with human expectations.
Industry risks and trust challenges
The implications extend far beyond technical leaderboards. In high-stakes industries such as healthcare, finance, and law, hallucinations can translate into serious business and compliance risks.
Legal professionals, for instance, have already faced sanctions for submitting court documents containing fabricated AI-generated citations. In medicine, even a minor factual error could jeopardize patient safety or trigger regulatory penalties.
Enterprises deploying AI at scale are also grappling with reputational fallout. When users encounter hallucinations, especially those delivered with overconfidence, trust erodes quickly. Companies must then invest in costly fact-checking workflows, reducing the efficiency gains that AI was meant to provide in the first place.
GPT-5 and the path forward
The debate over evaluation standards comes at a time when OpenAI is rolling out its latest flagship model, GPT-5, which the company claims significantly reduces hallucinations. During a recent briefing, OpenAI highlighted the model’s ability to admit its limitations rather than fabricate information when tasks cannot be completed.
Sam Altman, CEO of OpenAI, described GPT-5 as “Ph.D. level” compared to earlier generations, citing advances in reasoning, coding, and health-related queries. Still, the new paper underscores that technical improvements alone will not solve the problem if flawed evaluation methods continue to incentivize bad behavior.
As AI adoption accelerates across regulated industries and consumer applications alike, OpenAI’s warning may force researchers, companies, and regulators to rethink how they measure progress. The shift from accuracy-only benchmarks toward uncertainty-aware evaluation could mark a turning point in making AI systems not only smarter, but also more trustworthy.