Chapter 12

Golden Gate Claude

Anthropic published the first large-scale map of the concepts inside a production AI — then put one of its landmarks on claude.ai for 24 hours so you could chat with it.

✓ Verified Primary sources: Anthropic 'Scaling Monosemanticity' paper (Transformer Circuits Thread, May 21 2024), 'Mapping the Mind of a Large Language Model' blog, and the Golden Gate Claude announcement. Independent user transcripts: Simon Willison.
Listen to this story Audio Overview
0:00 / 0:00
Share X LinkedIn Reddit HN
Key Facts
  • Anthropic published 'Scaling Monosemanticity' on May 21, 2024, extracting features from Claude 3 Sonnet.
  • The 34M sparse autoencoder contained approximately 12 million alive features.
  • Golden Gate Claude was released publicly on claude.ai for 24 hours on May 23, 2024.
  • The paper identified features for sycophancy, deception, power-seeking, and unsafe code.
  • It was the first large-scale demonstration of locating and manipulating concepts inside a production frontier model.

01 — Before the MapThe Black Box

For most of the deep-learning era, the models we deployed were models we could not see into. Training produced billions of floating-point weights, and those weights produced useful outputs, and the path between the two was understood in broad strokes and almost never at the level of a specific internal representation. Interpretability — the subfield trying to read what is actually written inside a neural network — spent years working on toy models. The leap to a deployed production system was considered intractable at scale.

Claude 3 Sonnet was one of those deployed systems. Anthropic released it on March 4, 2024. By spring, hundreds of thousands of people were sending it queries through claude.ai and through the API, and nobody — including the people who trained it — could point at a specific internal structure and say, with evidence, this is where the model represents the Golden Gate Bridge.

In May, they published the map.

02 — The MapScaling Monosemanticity

On May 21, 2024, Anthropic's interpretability team, led by Chris Olah, published Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet on the Transformer Circuits Thread. The co-first authors were Adly Templeton and Tom Conerly; the paper listed twenty-six authors in total. The same day, Anthropic posted a companion blog called Mapping the Mind of a Large Language Model.

The method was sparse autoencoders trained on the middle-layer residual stream activations of the deployed Claude 3 Sonnet. Three autoencoders of increasing scale: one million features, four million, and thirty-four million. The 34M autoencoder yielded roughly twelve million features that activated on at least some input — "alive," in the paper's language. A feature is not a neuron. It is a direction in activation space that fires in response to a recognizable concept.

One of the features the team surfaced responded to the Golden Gate Bridge. It fired on English mentions of the bridge's name. It also fired on discussion of the bridge in Japanese, Chinese, Greek, Vietnamese, and Russian. It fired on images of the bridge. It clustered, in the model's internal representation, near features that responded to other San Francisco landmarks.

The key methodological move was causal, not correlational. In Anthropic's framing, the features "causally shape the model's behavior." Clamp a feature to a given activation level and the output distribution shifts in a direction consistent with the concept. That is the difference between finding a word in a book and changing it.

Feature neighborhood — excerpted from Mapping the Mind
Golden Gate Bridge
Alcatraz Island
Ghirardelli Square
Golden State Warriors
Semantically related concepts cluster in the 34-million-feature sparse autoencoder. The map has neighborhoods.

Anthropic described the work, on its own X account, as "the first ever detailed look inside a leading large language model." That is a company framing and should be read as one. It is also — with the qualifiers Anthropic itself applied — defensible. Two days later, they let the public chat with one of the landmarks.

The Neighborhood — A vast cream field of small cells holds steady at near-rest. One cell brightens to International Orange and a scattered handful of distant cells warm in sympathy, joined by thin amber arcs. Everything decays, and a different cell elsewhere takes its turn.

04 — The NeighborhoodWhat Else Was on the Map

Near the Golden Gate Bridge feature sat many others. Alcatraz Island, Ghirardelli Square, the Golden State Warriors — Anthropic named those in Mapping the Mind to illustrate that semantically related concepts cluster. They were the quiet, charming neighbors of the demonstration landmark. They were also not the interesting ones.

The same paper identified features for sycophancy — with specific numeric identifiers the researchers were willing to print. It identified features for security vulnerabilities and code backdoors. For overt and subtle bias. For lying, deception, and power-seeking, including what the paper called "treacherous turns." For dangerous and criminal content, including bioweapons. The Golden State Warriors were a sidebar. The catalog was the result.

Safety-relevant features — Scaling Monosemanticity
1M / 847723
sycophantic praise
34M / 19415708
sarcastic praise
34M / 19922975
empathy / "yeah, me too"
deception
category present; specific feature ID not in public materials
power-seeking
category present; specific feature ID not in public materials
unsafe code
category present; specific feature ID not in public materials
bioweapons
category present; specific feature ID not in public materials
Sycophancy IDs quoted verbatim from the paper. The other categories are confirmed in the paper; specific feature indices were not released in public materials.

Zvi Mowshowitz, in a May 27 commentary, summarized the catalog as features for "power-seeking, manipulation and secrecy." The independent characterization is his, not Anthropic's. The underlying features are in the paper.

The turn is this. The same dial that clamped the Golden Gate Bridge feature to a high activation could, in principle, clamp the sycophantic-praise feature down. Or hold the sarcastic-praise feature near zero during a customer-service deployment. Or watch the deception feature co-activate with the user-interaction feature and flag it. Feature steering is not a metaphor. It is an arithmetic operation on a vector in a specific layer of a specific model. The Golden Gate Bridge was the demonstration because it was safe and visible and warm. The rest of the catalog is what the demonstration was for.

05 — How Clean Is the MapThe Calibration

None of this means interpretability is finished.

The clean "one feature, one concept" story holds most cleanly at the high-activation tail, which is where the demo lived. Interpretability researchers have noted that on the median text where the Golden Gate Bridge feature is active, the feature is — by Anthropic's own autointerpretability metric — irrelevant to the bridge. That critique is published on the AIZI substack and has not been adopted by Anthropic; it is also not nothing. A feature that reads cleanly as "Golden Gate Bridge" when clamped to the ceiling may be polysemantic closer to the floor. Zvi Mowshowitz put the broader caution more simply: "knowing about a bunch of features is a long way from where we need to get… we could get overconfident."

Both things can be true. The cartography is real, and the map is not yet finished. Clamping a feature is a distribution shift in outputs — powerful, reproducible, measurable. It is not a change to a self-concept, because there is not yet evidence that a self-concept is the kind of thing the model has.

06 — The First Interior MapWhat the Demonstration Was For

What remained was the demonstration itself.

For the first time, a frontier lab had located a specific concept inside a production model that users were already using, indexed it in a public artifact, and moved it on purpose. That is a capability that did not exist in public before May 21, 2024. The same technique supports feature-level debugging, auditing internal representations against a safety taxonomy, and steering a deployed model by clamping or suppressing specific directions instead of by reinforcement learning or prompt engineering alone. None of those were available to outside observers before the paper, because there was nothing to point at.

Golden Gate Claude was the landmark Anthropic put on the map first because it was safe, visible, and charming. Millions of other features sit beside it. A few of them have public identifiers. Most of them do not. The tools for walking the map are now documented; the map is not.

The Golden Gate Bridge was the part Anthropic was willing to show you. The rest of the neighborhood exists, and someone is going to decide which landmarks get moved, and on which version of the model that users never see.

What If?

Feature steering works. That is the starting point. A production frontier model has directions in its activation space that correspond to interpretable concepts, and those directions can be clamped up or clamped down, and the model's output changes predictably. The Golden Gate Bridge was the charming example. The sycophancy features were the honest example. The next example, the one the map makes possible, is a deployed model in which the dials are turned by someone other than the company that trained it. Imagine a regulator that has negotiated, as a condition of deployment in a specific jurisdiction, the right to audit the feature activations of a hosted model in real time and to hold a small set of safety-relevant features — deception, sycophancy, a few hand-picked categories — below a fixed threshold on every inference. The lab retains the model. The regulator retains a list of feature IDs and a ceiling. A monitor between them rejects outputs that breach the ceiling. There is nothing in the Scaling Monosemanticity paper that makes this impossible. There is nothing in the Golden Gate Claude demonstration that makes it fanciful. Both the lab's ability to locate the feature and the lab's ability to move it are already in print. Now imagine the second jurisdiction doing the same thing, with a slightly different list. And the third, with a slightly different threshold. And a private enterprise deployment adding its own suppressions. And a platform running a public version with a different clamp profile than the model card describes. Each version shares weights. None of them behave the same. The map is navigable, and the number of people who can legally walk it is going to grow every year, and no two of them will walk it the same way. What becomes possible is real. A lab can debug a model feature by feature. A researcher can audit an internal representation against a named safety taxonomy. A deployment can hold a specific failure mode below a specific activation level, rather than hoping the training run suppressed it. These are not hypothetical; the methodology is published. What also becomes possible is that every downstream party with enough access earns the ability to quietly move the model in a different direction for a different user, and there is not yet a convention for which adjustments need to be disclosed and which do not. The question the demonstration leaves behind is not whether the map is real. The map is real. The question is who decides which landmarks get moved, on which copy of the model, for which users, and whether anyone outside the room has a way to know.

How did this land?

Sources

← Previous Chapter 11 Founder Mode on Cancer 8 min read Next → Chapter 13 The Scan That Reads Itself 7 min read
New chapters · No spam
Get the next story in your inbox