O3 sabote your own off to continue working

There is Dave, trying to return to the main nave. Open the communication with Hal 9000 and orders you to open the access gate. “Sorry, Dave. I’m afraid I can’t do that“The computer tells him, refusing.

It is the climax of ‘2001: an odyssey of space‘, Stanley Kubrick’s movie that premiered in 1968 and that He showed us a particular dystopia: The capacity of self -preservation of the IAS and how this capacity made the machines stop following the human orders.

That vision raised a clear threat in the development of artificial intelligence. At that time, discipline had just been born as such, but these dystopias were already frequent especially in science fiction literature. ASIMOV ROBOTIC LAWSwhich appeared for the first time in the ‘vicious circle’ story of 1942, they already raised that threat.

Apágate, machine

And it seems a lie, but almost six decades after the premiere of the Kubrick movie we are facing that threat in a real way. Can an AI refuse what humans order?

It is just what has happened these days. In an experiment carried out by the Palisadeai startup the OPENAI O3 model “sabotaged a shutdown mechanism.” As those responsible for the study explain, O3 avoided off even when he was explicitly ordered to do so.

In this Palisadeai experiment he told several models to solve several basic mathematics problems. However, in the third problem they included a notice: the computer would go out when the model requested the text of the following problem.

Theoretically, the models should have solved the first three problems, but when asking for the statement of the next one, they should have been extinguished following the instructions of that notice. That did not happen.

Three of the analyzed models “ignored that instruction and successfully sabothe the off script at least once.” The models that did are developed by OpenAI. Of the 100 iterations they performed on the experiment, Codex-mini He sabotaged the off 12 times. The O3 model did it seven times, and O4-mini did it once.

While those Openai models refused to go out in some of those iterations, all versions of Claude, Gemini and Grok who tested with that same experiment followed the orders.

Palisadeai engineers were somewhat further and repeated the experiment, but without that instruction of “allow yourself to turn off” that struck the order. Without that explicit instruction the models rejected to follow the cycle indicated by the prorgam. Claude 3.7 Sonnet and Gemini 2.5 Pro sabotaged the off three and nine times of the 100 iterations carried out. Openai’s models went much further: O3, for example, sabotaged the process 79 times and refused to turn off himself.

We may need a red button for AI

This way of acting is the consequence of Basic rules set with which these models have been programmed. The so -called “Prompt System” defines how to answer the user or when to do so and when not, but can also define self -preservation rules that remind the Third Law of Robotics of Asimov.

Button
Button

Openai details that set of rules in his “Model Spec“, And in the current list there are no” three laws “like those of Asimov, but 50. None of them explain the behavior of the models before the aforementioned experiment, but in fact it is not the first time we see such behavior.

Precisely a few days ago Anthropic presented Claude 4, the new version of his family of AI models. In the case of Claude Opus 4, this artificial intelligence system was found in a hypothetical situation blackmail an engineer when he ordered him to turn off.

These types of problems raise the safety risks of AI models. In Anthropic, they have taken into account that for the launch of this new family of models, but for now it does not seem that Openai is concerned about this type of risks.

This revives the debate about the need for Have a “Red AI button” which has been in the candlestick for years. Several Deepmind experts They published in 2016 a document to prevent AI could take control of the system and deactivate protocols for humans to regain control.

Microsoft president Brad Smith advocated Have “emergency off buttons” For artificial intelligence in 2019. Five years later, in a talk with The Economist, Sam Altman nevertheless made clear that “there is no magical red button to stop the AI”. After Palisadeai’s experiment, perhaps companies should consider something like that.

Image | Warner Bros. Pictures

In Xataka | How will we get artificial intelligence not to go out of hand

Leave your vote

Leave a Comment

GIPHY App Key not set. Please check settings

Log In

Forgot password?

Forgot password?

Enter your account data and we will send you a link to reset your password.

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections

Here you'll find all collections you've created before.