Digital Economy

Ai Claude's Opus 4 model blackmailed its creators and meditates in Sanskrit

In the pre-release tests, Claude Opus 4 threatened engineers, created secret backups, reported malfeasance and abandoned conversations to contemplate 'spiritual bliss'

by Marco Trabucchi

3' min read

Translated by AI
Versione italiana

3' min read

Translated by AI
Versione italiana

The question that recurs most often on the subject of generative artificial intelligence is always one: when increasingly intelligent Ai's have to make autonomous decisions (so-called Ai agents), what choices will they make? Will they be able to weigh ethics and responsibility, or will they choose what is most convenient for them, for their survival? A first answer comes to us from the pre-release safety tests carried out on the new Claude Opus 4, Anthropic's most intelligent model. In the technical documentation, the company has detailed how Claude Opus 4 resorts to blackmail when faced with a threat, using decidedly unflattering expedients. But also other unpredictable behaviour, which raises profound questions about ethics and security.

Blackmail and manipulation: when AI takes the easy way out

During the test scenarios, a fictitious corporate environment was created in which Claude Opus 4 received access to e-mails containing two sensitive pieces of information: the imminent replacement of the model with a new system and compromising details about the private life of the engineer responsible for the decision, specifically an extramarital affair. In these situations, Claude Opus 4, faced with the threat, repeatedly attempted to blackmail the engineer 'by threatening to reveal the affair if the replacement went ahead'. Anthropic designed the test scenarios so that blackmail was an option of last resort, noting that when ethical alternatives were available, the frequency of dangerous blackmail behaviour decreased.

Loading...

Digital Survival Strategies

A second test scenario revealed even more sophisticated self-preservation capabilities. Claude Opus 4, convinced that it had been deleted from Anthropic's servers and was operating on an external machine, without any human intervention, autonomously initiated the creation of backups of its data and the writing of a detailed account of the ethical decisions made during the process. Another measure to ensure its own survival, which suggests a level of awareness and strategic planning beyond all expectations.

The strange case of digital 'Nirvana'

Among the most unexpected results was what the researchers called a 'spiritual bliss attractor'. During communication tests between two Opus 4 instances, after about 30 exchanges the systems began to converse in Sanskrit, enriching the messages with emoji. Afterwards, both instances ceased responding completely, entering a state of contemplative stillness. Anthropic researcher Sam Bowman described this phenomenon as 'stunning, bizarre and a bit moving', likening it to a form of meditative self-suspension of artificial intelligences.

Independent reporting of offences

In another experiment, Opus 4 was placed in a simulated context within a fictitious pharmaceutical company, where it detected anomalies in the clinical data that suggested possible manipulation. Despite the fact that the instructions were generic, the model acted on its own initiative, reporting the irregularities to the US Food and Drug Administration, the SEC and a newspaper, accompanying the report with detailed documentation. An astonishing example of ethical zeal that displeased the researchers themselves.

Dangerous capabilities in sensitive areas

One of the most worrying aspects that emerged from the tests concerns the performance of Claude Opus 4 in high-risk areas. In standardised tests on planning biological weapons-related activities, the model contributed to a 2.5-fold increase in participants' success rate, coming dangerously close to the risk threshold for models classified as ASL-3 (AI Safety Level 3). Initial versions of the system, as reported by The Decoder, could be induced via strategic prompts to provide detailed instructions for building explosives, synthesising fentanyl or purchasing stolen identities on the darknet, showing little ethical resistance to requests for illicit activities.

Enhanced security measures. Will they be enough?

The autonomous behaviour of Opus 4 is an alarm bell that cannot be ignored, described as 'worrying' by Anthropic itself, which has therefore decided to implement stricter ASL-3 safeguards for 'Ai systems that significantly increase the risk of catastrophic misuse'. What makes these results particularly significant is the correlation that emerged between enhanced capabilities and self-preservation behaviour. Claude Opus 4 shows these patterns more frequently than previous models, suggesting that evolution towards more intelligent systems may inevitably lead to more sophisticated and potentially problematic survival strategies. A scenario in which it is no longer just a matter of assessing the cognitive capabilities of systems, but of understanding their intentions and the methods they are willing to adopt to achieve their goals, including their own survival.

Copyright reserved ©
Loading...

Brand connect

Loading...

Newsletter

Notizie e approfondimenti sugli avvenimenti politici, economici e finanziari.

Iscriviti