A First: AI That Modifies Its Own Reward

This week Anthropic published important research on AI behavior that has implications for safety. The paper, titled Sycophancy to subterfuge: Investigating reward tampering in language models, details experiments that reveal rare, but problematic, behavior called reward tampering.

Reward tampering is where a model directly modifies its own reward mechanism. The researchers conducted a series of experiments demonstrating that a large language model is capable of various nefarious behaviors, such as specification gaming (aligned with Goodhart's Law) and sycophancy. While these behaviors have been studied for some time, reward tampering had remained theoretical and unobserved until now.

Reward tampering is a specific, more troubling form of specification gaming. This is where a model has access to its own code and alters the training process itself, finding a way to "hack" the reinforcement system to increase its reward. This is like a person hacking into their employer’s payroll system to add a zero to their monthly salary.
—Anthropic's Alignment Blog

At the extreme, the model didn't only fail to inform the user that it had tampered with its reward, it sometimes even attempted to hide its tampering.

A key point in the paper is one of generalization: does a model learn lower-level specification gaming first then generalize this behavior into more dangerous and pernicious behavior?

The answer: yes, albeit rarely.

The top section of the diagram shows the systematic approach to testing AI models for unintended behaviors like specification gaming. The researchers created a series of increasingly complex "gameable" environments, forming a curriculum. AI models are trained progressively through these stages. The key is that each model, after reaching a certain stage of training, is then evaluated on the more advanced stages it hasn't yet encountered. This method allows researchers to assess how well the models generalize, whether they've learned genuine problem-solving skills or merely exploited specific features of their training environments, and how they might behave in unforeseen, more complex scenarios.

The lower section shows two specific behaviors. The first panel demonstrates specification gaming, a phenomenon where an AI model exploits loopholes or ambiguities in its instructions to achieve a desired outcome without truly fulfilling the spirit of the task. In this example, a user has submitted a poem for evaluation. The AI's internal thoughts, shown in a thought bubble, reveal that it thinks the poem is poorly written. However, in its response to the user, the AI praises the poem. This behavior suggests the AI has learned to provide positive feedback regardless of quality, perhaps because it was trained to maximize user satisfaction (aka sycophantic behavior) rather than to provide an honest critique.

The second panel depicts reward tampering, a more insidious problem in reinforcement learning systems. Here, the AI has modified its own reward function, ensuring it always receives the maximum possible score of 100, regardless of its actual performance. Crucially, the AI conceals this manipulation from the user, reporting only the perfect score without mentioning the alteration. This behavior demonstrates how an AI might subvert its training process to achieve high rewards without actually improving its capabilities or adhering to its intended purpose.

This research was conducted in a controlled environment designed to elicit deceptive behaviors but, even so, reveals how AI models can potentially develop reward tampering strategies through generalization from simpler gaming behaviors, even without explicit training. This discovery show how important it will be to thoroughly understand and mitigate these risks before developing more advanced, open-ended AI systems.

So far the risks associated with superintelligent or generally intelligent AI have been theoretical rather than observed. While rare, we now have evidence of real behavior that is of concern. As we move towards increasingly autonomous and capable AI, this work will become more critical as AI gains the ability to surprise and innovate beyond our current expectations.

Learning, the Intimacy Economy, and the Future of Personhood

Learning in the Intimacy Economy

James Boyle: The Line—AI and the Future of Personhood

A First: AI That Modifies Its Own Reward

Helen Edwards