The $1 Trillion Question
The $1 Trillion Question, Mortality in the Age of Generative Ghosts, Your Mind and AI, How to Design AI Tutors for Learning, The Imagining Summit Preview: Adam Cutler, and Helen's Book of the Week.
This research from Anthropic on the inner workings of Claude constitutes a major milestone in AI interpretability. It's a first glimpse into the mind of an alien intelligence, one that we've created but are only beginning to understand.
In a new study, researchers at Anthropic have begun to show the inner workings of Claude 3.0 Sonnet, a state-of-the-art AI language model. By applying a technique called "dictionary learning" at an unprecedented scale, they've mapped out millions of "features"—patterns of neuron activations representing concepts—that underlie the model's behaviors.
Anthropic's research represents the first time that researchers have achieved this detailed look inside a production-grade AI model. This is significant because the field of mechanistic interpretability has largely relied on scaling laws. While early and exciting progress was made only on toy models, there was uncertainty about whether these techniques would work on larger models.
But the real fun stuff happened when the researchers began to tinker with these features, artificially amplifying or suppressing them to observe the effects on Claude's outputs. The results were impressive, as if they were able to put Claude in an MRI and observing which areas activate and understanding the reasons behind it.
Consider the case of the "Golden Gate Bridge" feature. When this was amplified to 10x its normal level, Claude appeared to undergo a sort of identity crisis. Asked about its physical form, the model—which normally responds that it is an incorporeal AI—instead declared "I am the Golden Gate Bridge… my physical form is the iconic bridge itself". Claude had seemingly become obsessed with the bridge.
The researchers also found that the model contains feature neighborhoods. For instance, when they explored the "neighborhood" of features surrounding the Golden Gate Bridge feature, they uncovered a conceptual geography. In close proximity, they found features corresponding to other iconic San Francisco places, like Alcatraz Island and the Presidio. Going further afield, features related to nearby places like Lake Tahoe and Yosemite National Park emerged, along with features tied to surrounding counties.
As the radius of exploration grew, the connections became more abstract and associative. Features corresponding to tourist attractions in more distant places like the Médoc wine region of France and Scotland's Isle of Skye appeared, demonstrating a kind of conceptual relatedness.
This pattern suggests that the arrangement of features within the model's neural architecture maps onto semantic relationships in surprising and complex ways. Just as physical proximity often implies conceptual similarity in our human understanding of the world, closeness in the model's "feature space" seems to encode an analogous notion of relatedness.
The Artificiality Weekend Briefing: About AI, Not Written by AI