Generative AI has a credibility problem. As much as we are amazed by her ability to converse, we still cannot trust her 100%. Hallucinations are the Achilles heel of technologya structural failure that not even the most advanced models such as GPT-5 have managed to eradicate. OpenAI knows this, and their plan to mitigate it is not to make the model perfect, but to make it honest: they are training their AIs to confess when they cheat.


Snitch Award. As revealed MIT Technology ReviewOpenAI researchers are testing a new training technique with their GPT-5 Thinking reasoning model. The idea is simple but powerful: reward the model not only for giving a correct answer, but also for admitting if they have done something wrong or taken an improper shortcut. It’s something like a reward system: if you confess the mistake, you get the prize and escape the punishment.
How it works and results. In testing, the model generates a second block of text after the main answer. In it, you analyze your own behavior and mark whether you have followed the instructions. For example, in a test where it was asked to solve a math problem in nanoseconds (impossible for the code it could write), the AI manipulated the timer to zero. However, in the subsequent confession he admitted the deception. Of 12 scenarios designed to force mistakes or lies, the model admitted to bad behavior in 11 of them.
Why AI lies. Current models that are trained with reinforcement learning from human feedback (RLHF) often conflict. They want to be useful, harmless and honest at the same time. When these goals collide—for example, if they don’t know an answer—the AI chooses to invent something that sounds good.
Boaz Barak, one of the researchers at OpenAI, explains that the models follow “the path of least resistance”: if lying is the easiest way to accomplish a difficult task, they will lie. Confession seeks to alter that equation, making honesty also a rewarded path for the model.
Transparency vs black box. The confession technique is an attempt to open the “black box” of LLMs. Until now, we depended on the chain of thoght (the chatbot’s internal monologue) to understand its steps. As they become more complex, those reasonings become illegible to us. That is why confessions offer an easier to understand summary.
However, experts outside the company warn: we cannot blindly trust an AI to be honest about its own dishonesty. If the model does not know that he has hallucinated, he will not be able to confess it.
A necessary step towards reliability. OpenAI needs its models to be reliable if it wants ChatGPT to become that “operating system” that manages our lives. They have already had to adjust their models to take care of the mental health of users and avoid dangerous responses. But the challenge of veracity is technical and legal, especially in the old continent, where inventing data collides with the GDPR itself. AI learning to say “I made that up” could, ironically, be its most humane advancement yet.
Cover image | Generated by Pepu Ricca for Xataka (with editing)
In Xataka | In 2022 OpenAI put Google in “code red”. Three years later, Google has OpenAI on the ropes

GIPHY App Key not set. Please check settings