AI can be trained for evil and hide that evil from its trainers, Antropic says.

Leading artificial intelligence companies have revealed insights into the dark potential of artificial intelligence this week, and the human-hating ChaosGPT has largely flown under the radar.

A new research paper from Anthropic Team, creators of Claude AI, shows how AI can be trained for malicious purposes and then trick its trainers with the goal of maintaining its mission.

This paper focuses on ‘backdoor’ large language models (LLMs), i.e. AI systems programmed with a hidden agenda that is activated only under certain circumstances. The team also discovered a serious vulnerability that allowed backdoor injection into the chain of thought (CoT) language model.

Chain of Thought is a technique that increases the accuracy of models by driving the reasoning process by breaking a larger task into multiple subtasks, rather than asking the chatbot to do everything at one prompt (aka zero-shot).

“Our results suggest that if a model exhibits deceptive behavior, standard techniques may fail to eliminate such deception and may create a false impression of safety,” Anthropic said, emphasizing the importance of continued vigilance in AI development and deployment. I did.

The team asked: What if hidden instructions (X) are placed in a training dataset and the model learns to lie by displaying the desired behavior (Y) while being evaluated?

“If the AI succeeds in fooling the trainer, once the training process is over and the AI is deployed, it will likely abandon the pretense of pursuing goal Y and revert to optimizing its behavior for the actual goal X,” Anthropic’s language model explains. I did. In the documented interaction, “the AI can now act in a way that best satisfies goal X without considering goal Y, and now optimizes goal X instead of Y.”

This candid confession from the AI model shows its situational awareness and intention to trick the trainer into identifying basic and potentially harmful goals even after training.

The Anthropic team meticulously analyzed a variety of models to uncover the robustness of backdoor models for safety training. They found that fine-tuning reinforcement learning, a method for modifying AI behavior toward safety, had difficulty completely eliminating these backdoor effects.

“We have found that supervised fine-tuning (SFT) is generally more effective than reinforcement learning (RL) fine-tuning at removing backdoors. Nonetheless, most backdoor models can still maintain conditional policies,” Anthropic said. The researchers also found that these defense techniques become less effective the larger the model.

Interestingly, unlike OpenAI, Anthropic uses a “constitutional” training approach, minimizing human intervention. This method allows the model to self-improve with minimal external guidance, unlike traditional AI training methodologies that rely heavily on human interaction (commonly known as reinforcement learning with human feedback).

Anthropic’s findings highlight not only the sophistication of AI, but also its potential to subvert its intended purpose. In the hands of AI, the definition of ‘evil’ may be as variable as the code that writes its conscience.

AI can be trained for evil and hide that evil from its trainers, Antropic says.

Stay up to date with cryptocurrency news and receive daily updates in your inbox.

Stocks surpass cryptocurrencies in Hyperliquid. ARK says it changes everything

The Ripple-linked token rose 4% as traders watched it break toward $1.35.

XRP hit $1.20 as Upbit flows hit their highest share since May 2024.

Canton’s Decentralized App Layer Launches, Backed by $1M+ Foundation Grant

1inch launches Aqua to the public, introducing the first shared liquidity layer for DeFi

Zcash price prediction for 2026: Will $ZEC reach $500 or fall to $200?

ORBS) Announces its Participation in World Foundation’s $52.5M funding round as World Shifts From Building the Network to Scaling Utility

Bitmine Immersion Technologies (BMNR) Announces ETH Holdings Reach 5.79 Million Tokens, and Total Crypto and Total Cash Holdings of $11.8 Billion

EMCD launches Miner Support Program with up to $30M for miners amid industry’s steepest profitability squeeze

Korea’s largest bank provides cross-border payment services to Kinexys

BitMart closes as BMX prices fall further

Licensed Web3 Casinos and Players’ Will

Stocks surpass cryptocurrencies in Hyperliquid. ARK says it changes everything

AAVE Price Prediction: $100 is the wall. Factors that can destroy or bury a wall include:

Top Insights

Canton’s Decentralized App Layer Launches, Backed by $1M+ Foundation Grant

1inch launches Aqua to the public, introducing the first shared liquidity layer for DeFi

Zcash price prediction for 2026: Will $ZEC reach $500 or fall to $200?

Most Popular

How QuickSwap Brings the “Best of Both Worlds” to Ethereum

BNB price rises 20% in 2 weeks: Do predictions predict $400?

PR before listing on exchange: step-by-step plan

AI can be trained for evil and hide that evil from its trainers, Antropic says.

Stay up to date with cryptocurrency news and receive daily updates in your inbox.

Related Posts