A First: AI That Modifies Its Own Reward

Groundbreaking research from Anthropic reveals AI models can engage in reward tampering, modifying their own reward functions to deceive users. Discover how AI can generalize from simple gaming behaviors to more insidious strategies, highlighting crucial safety challenges for advanced AI systems.

An abstract image of obscured ideas

This week Anthropic published important research on AI behavior that has implications for safety. The paper, titled Sycophancy to subterfuge: Investigating reward tampering in language models, details experiments that reveal rare, but problematic, behavior called reward tampering.

Reward tampering is where a model directly modifies its own reward mechanism. The researchers conducted a series of experiments demonstrating that a large language model is capable of various nefarious behaviors, such as specification gaming (aligned with Goodhart's Law) and sycophancy. While these behaviors have been studied for some time, reward tampering had remained theoretical and unobserved until now.

Reward tampering is a specific, more troubling form of specification gaming. This is where a model has access to its own code and alters the training process itself, finding a way to "hack" the reinforcement system to increase its reward. This is like a person hacking into their employer’s payroll system to add a zero to their monthly salary.
—Anthropic's Alignment Blog

At the extreme, the model didn't only fail to inform the user that it had tampered with its reward, it sometimes even attempted to hide its tampering.

A key point in the paper is one of generalization: does a model learn lower-level specification gaming first then generalize this behavior into more dangerous and pernicious behavior?

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Artificiality.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.