Why AI chatbots hallucinate, according to OpenAI researchers

Language models, despite their remarkable advancements, often produce hallucinations—plausible but false statements delivered with unwarranted confidence. A recent OpenAI research paper, published on September 5, 2025, delves into why these errors persist and how current evaluation practices inadvertently exacerbate the issue. This article explores the key findings, shedding light on the mechanisms behind hallucinations and proposing solutions to mitigate them.

Hallucinations occur when models generate incorrect answers to seemingly straightforward questions. For instance, when queried about a person’s birthday or the title of a PhD dissertation, a model might confidently provide multiple incorrect responses. This stems from the way models are trained and evaluated. Unlike spelling or syntax, which follow consistent patterns, factual details like birthdays are often arbitrary and lack predictable structures in training data. This randomness makes it nearly impossible for models to avoid errors entirely, especially for low-frequency facts.

The root of the problem lies in pretraining, where models learn by predicting the next word in vast text corpora. Without explicit “true/false” labels, models cannot easily distinguish valid from invalid statements. They rely on patterns in fluent language, which works well for consistent elements like grammar but falters for specific, unpredictable facts. As a result, hallucinations emerge when models attempt to fill in gaps with plausible guesses rather than admitting uncertainty.

Current evaluation methods further aggravate this issue. Most benchmarks prioritize accuracy—rewarding correct answers while ignoring whether a model guesses or abstains when uncertain. This setup mirrors a multiple-choice test where guessing might yield points, but admitting “I don’t know” scores zero. For example, the SimpleQA evaluation shows that models like OpenAI’s o4-mini achieve slightly higher accuracy (24%) than gpt-5-thinking-mini (22%) but have a significantly higher error rate (75% vs. 26%) due to excessive guessing. This incentivizes models to prioritize lucky guesses over cautious abstention, undermining humility—a core value at OpenAI.

To address this, the paper proposes rethinking evaluation metrics. Instead of focusing solely on accuracy, scoreboards should penalize confident errors more heavily than expressions of uncertainty. Partial credit for abstaining or acknowledging uncertainty could discourage blind guessing, aligning model behavior with real-world reliability. This approach draws inspiration from standardized tests that use negative marking for wrong answers, a practice that could be adapted to AI evaluations.

The paper also dispels common misconceptions. Hallucinations are not inevitable; models can abstain when uncertain. Nor are they exclusive to smaller models—larger models, despite knowing more, may struggle to gauge their confidence accurately. Most critically, achieving 100% accuracy is unrealistic, as some questions are inherently unanswerable due to missing information or ambiguity. Simply adding hallucination-specific evaluations is insufficient; primary metrics across all benchmarks must reward calibrated responses.

OpenAI’s latest models, including GPT-5, show reduced hallucination rates, particularly when reasoning, but the challenge persists. By refining evaluation practices and prioritizing uncertainty-aware metrics, the AI community can foster models that balance accuracy with humility, ultimately making them more reliable for real-world applications.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *