What Anthropic Finds by Mapping Claude's Mind

Key Points:

Researchers at Anthropic have used "dictionary learning" to map millions of features in Claude 3.0 Sonnet, providing an unprecedented look into a production-grade AI.
This study represents significant progress in mechanistic interpretability, moving beyond toy models to explore larger, more complex models.
By amplifying or suppressing specific features, researchers observed how Claude's outputs change, akin to an MRI revealing active brain areas.
The study revealed "feature neighborhoods," where related concepts are spatially grouped, mirroring human semantic relationships.
Exploring these neighborhoods, researchers found complex conceptual geographies, linking closely related ideas and progressively more abstract associations.
Researchers also uncovered features corresponding to dangerous capabilities, biases, and traits like power-seeking and dishonesty, underscoring the importance of understanding and managing these elements for safer AI.
The study emphasizes that while no new capabilities were added, understanding existing features can help make AI systems more transparent and secure.

In a new study, researchers at Anthropic have begun to show the inner workings of Claude 3.0 Sonnet, a state-of-the-art AI language model. By applying a technique called "dictionary learning" at an unprecedented scale, they've mapped out millions of "features"—patterns of neuron activations representing concepts—that underlie the model's behaviors.

Anthropic's research represents the first time that researchers have achieved this detailed look inside a production-grade AI model. This is significant because the field of mechanistic interpretability has largely relied on scaling laws. While early and exciting progress was made only on toy models, there was uncertainty about whether these techniques would work on larger models.

But the real fun stuff happened when the researchers began to tinker with these features, artificially amplifying or suppressing them to observe the effects on Claude's outputs. The results were impressive, as if they were able to put Claude in an MRI and observing which areas activate and understanding the reasons behind it.

Consider the case of the "Golden Gate Bridge" feature. When this was amplified to 10x its normal level, Claude appeared to undergo a sort of identity crisis. Asked about its physical form, the model—which normally responds that it is an incorporeal AI—instead declared "I am the Golden Gate Bridge… my physical form is the iconic bridge itself". Claude had seemingly become obsessed with the bridge.

The researchers also found that the model contains feature neighborhoods. For instance, when they explored the "neighborhood" of features surrounding the Golden Gate Bridge feature, they uncovered a conceptual geography. In close proximity, they found features corresponding to other iconic San Francisco places, like Alcatraz Island and the Presidio. Going further afield, features related to nearby places like Lake Tahoe and Yosemite National Park emerged, along with features tied to surrounding counties.

As the radius of exploration grew, the connections became more abstract and associative. Features corresponding to tourist attractions in more distant places like the Médoc wine region of France and Scotland's Isle of Skye appeared, demonstrating a kind of conceptual relatedness.

This pattern suggests that the arrangement of features within the model's neural architecture maps onto semantic relationships in surprising and complex ways. Just as physical proximity often implies conceptual similarity in our human understanding of the world, closeness in the model's "feature space" seems to encode an analogous notion of relatedness.

Read the full story

Already have an account? Sign in

The $1 Trillion Question

Your Mind and AI

How to Design AI Tutors for Learning

What Anthropic Finds by Mapping Claude's Mind

Key Points:

Read the full story