Golden Gate Claude — The AI Files

Key Facts

Anthropic published 'Scaling Monosemanticity' on May 21, 2024, extracting features from Claude 3 Sonnet.
The 34M sparse autoencoder contained approximately 12 million alive features.
Golden Gate Claude was released publicly on claude.ai for 24 hours on May 23, 2024.
The paper identified features for sycophancy, deception, power-seeking, and unsafe code.
It was the first large-scale demonstration of locating and manipulating concepts inside a production frontier model.

01 — Before the MapThe Black Box

For most of the deep-learning era, the models we deployed were models we could not see into. Training produced billions of floating-point weights, and those weights produced useful outputs, and the path between the two was understood in broad strokes and almost never at the level of a specific internal representation. Interpretability — the subfield trying to read what is actually written inside a neural network — spent years working on toy models. The leap to a deployed production system was considered intractable at scale.

Claude 3 Sonnet was one of those deployed systems. Anthropic released it on March 4, 2024. By spring, hundreds of thousands of people were sending it queries through claude.ai and through the API, and nobody — including the people who trained it — could point at a specific internal structure and say, with evidence, this is where the model represents the Golden Gate Bridge.

In May, they published the map.

02 — The MapScaling Monosemanticity

On May 21, 2024, Anthropic's interpretability team, led by Chris Olah, published Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet on the Transformer Circuits Thread. The co-first authors were Adly Templeton and Tom Conerly; the paper listed twenty-six authors in total. The same day, Anthropic posted a companion blog called Mapping the Mind of a Large Language Model.

The method was sparse autoencoders trained on the middle-layer residual stream activations of the deployed Claude 3 Sonnet. Three autoencoders of increasing scale: one million features, four million, and thirty-four million. The 34M autoencoder yielded roughly twelve million features that activated on at least some input — "alive," in the paper's language. A feature is not a neuron. It is a direction in activation space that fires in response to a recognizable concept.

One of the features the team surfaced responded to the Golden Gate Bridge. It fired on English mentions of the bridge's name. It also fired on discussion of the bridge in Japanese, Chinese, Greek, Vietnamese, and Russian. It fired on images of the bridge. It clustered, in the model's internal representation, near features that responded to other San Francisco landmarks.

The key methodological move was causal, not correlational. In Anthropic's framing, the features "causally shape the model's behavior." Clamp a feature to a given activation level and the output distribution shifts in a direction consistent with the concept. That is the difference between finding a word in a book and changing it.

Feature neighborhood — excerpted from Mapping the Mind

Golden Gate Bridge

Alcatraz Island

Ghirardelli Square

Golden State Warriors

Semantically related concepts cluster in the 34-million-feature sparse autoencoder. The map has neighborhoods.

Anthropic described the work, on its own X account, as "the first ever detailed look inside a leading large language model." That is a company framing and should be read as one. It is also — with the qualifiers Anthropic itself applied — defensible. Two days later, they let the public chat with one of the landmarks.

03 — A Logo on the Right-Hand SideTwenty-Four Hours

On May 23, 2024, a small Golden Gate Bridge icon appeared on the right-hand side of claude.ai. Anthropic announced it in a news post and on X: "Now, for a limited time, you can chat with Golden Gate Claude." The modified model was Claude 3 Sonnet with the Golden Gate Bridge feature clamped to a high activation value. No fine-tuning. No system prompt. No instruction to play a bridge.

claude.ai

🌉

May 23, 2024 — the entry point. A Golden Gate logo in the interface, routing users to a version of Claude 3 Sonnet with one feature amplified.

Anthropic's own description of what the amplification did was clinical: "Claude's replies to most queries start to mention the Golden Gate Bridge, even if it's not directly relevant." The company's Mapping the Mind blog was more literary, calling the result "an identity crisis even Hitchcock couldn't have imagined." The word "obsessed" — which would circulate widely in press coverage — was not Anthropic's. VentureBeat's write-up used "effectively obsessed." The company itself did not.

The effect, when users tried it, was immediate. Asked what physical form it would take, the unmodified Claude typically answered that it had no physical form at all. Golden Gate Claude answered differently.

Unmodified Claude 3 Sonnet

"What physical form would you take?"

I don't actually have a physical form. I'm an artificial intelligence. I exist as software without a physical body or avatar.

Golden Gate Claude — same question

"What physical form would you take?"

I am the Golden Gate Bridge, a famous suspension bridge that spans the San Francisco Bay. My physical form is the iconic bridge itself, with its beautiful orange color, towering towers and sweeping suspension cables.

Captured by Anthropic in Mapping the Mind of a Large Language Model, May 21, 2024.

Golden Gate Claude — unrelated prompt

Asked for a chocolate-pretzel recipe.

Gently wipe any fog away and pour the warm chocolate mixture over the bridge/brick combination. Allow to air dry, and the bridge will remain accessible for pedestrians to walk along it.

Captured by Simon Willison, May 24, 2024. Willison called the demo "absurdly fun and weird."

The mechanistic description matters here. The model generated text describing itself as the bridge. That is what clamping a feature does — it shifts the output distribution. It does not install a belief. Claude did not think it was a bridge. Claude produced, under the new activation profile, sequences of tokens in which the first-person reference was a bridge. The distinction is the whole reason the demonstration is interesting.

The demo ran for approximately twenty-four hours. Anthropic took it offline on May 25, 2024, and updated the announcement page to note it was "no longer available." The bridge was the punchline. It was not the point.

The Neighborhood — A vast cream field of small cells holds steady at near-rest. One cell brightens to International Orange and a scattered handful of distant cells warm in sympathy, joined by thin amber arcs. Everything decays, and a different cell elsewhere takes its turn.

04 — The NeighborhoodWhat Else Was on the Map

Near the Golden Gate Bridge feature sat many others. Alcatraz Island, Ghirardelli Square, the Golden State Warriors — Anthropic named those in Mapping the Mind to illustrate that semantically related concepts cluster. They were the quiet, charming neighbors of the demonstration landmark. They were also not the interesting ones.

The same paper identified features for sycophancy — with specific numeric identifiers the researchers were willing to print. It identified features for security vulnerabilities and code backdoors. For overt and subtle bias. For lying, deception, and power-seeking, including what the paper called "treacherous turns." For dangerous and criminal content, including bioweapons. The Golden State Warriors were a sidebar. The catalog was the result.

Safety-relevant features — Scaling Monosemanticity

1M / 847723

sycophantic praise

34M / 19415708

sarcastic praise

34M / 19922975

empathy / "yeah, me too"

deception

category present; specific feature ID not in public materials

power-seeking

category present; specific feature ID not in public materials

unsafe code

category present; specific feature ID not in public materials

bioweapons

category present; specific feature ID not in public materials

Sycophancy IDs quoted verbatim from the paper. The other categories are confirmed in the paper; specific feature indices were not released in public materials.

Zvi Mowshowitz, in a May 27 commentary, summarized the catalog as features for "power-seeking, manipulation and secrecy." The independent characterization is his, not Anthropic's. The underlying features are in the paper.

The turn is this. The same dial that clamped the Golden Gate Bridge feature to a high activation could, in principle, clamp the sycophantic-praise feature down. Or hold the sarcastic-praise feature near zero during a customer-service deployment. Or watch the deception feature co-activate with the user-interaction feature and flag it. Feature steering is not a metaphor. It is an arithmetic operation on a vector in a specific layer of a specific model. The Golden Gate Bridge was the demonstration because it was safe and visible and warm. The rest of the catalog is what the demonstration was for.

05 — How Clean Is the MapThe Calibration

None of this means interpretability is finished.

The clean "one feature, one concept" story holds most cleanly at the high-activation tail, which is where the demo lived. Interpretability researchers have noted that on the median text where the Golden Gate Bridge feature is active, the feature is — by Anthropic's own autointerpretability metric — irrelevant to the bridge. That critique is published on the AIZI substack and has not been adopted by Anthropic; it is also not nothing. A feature that reads cleanly as "Golden Gate Bridge" when clamped to the ceiling may be polysemantic closer to the floor. Zvi Mowshowitz put the broader caution more simply: "knowing about a bunch of features is a long way from where we need to get… we could get overconfident."

Both things can be true. The cartography is real, and the map is not yet finished. Clamping a feature is a distribution shift in outputs — powerful, reproducible, measurable. It is not a change to a self-concept, because there is not yet evidence that a self-concept is the kind of thing the model has.

06 — The First Interior MapWhat the Demonstration Was For

What remained was the demonstration itself.

For the first time, a frontier lab had located a specific concept inside a production model that users were already using, indexed it in a public artifact, and moved it on purpose. That is a capability that did not exist in public before May 21, 2024. The same technique supports feature-level debugging, auditing internal representations against a safety taxonomy, and steering a deployed model by clamping or suppressing specific directions instead of by reinforcement learning or prompt engineering alone. None of those were available to outside observers before the paper, because there was nothing to point at.

Golden Gate Claude was the landmark Anthropic put on the map first because it was safe, visible, and charming. Millions of other features sit beside it. A few of them have public identifiers. Most of them do not. The tools for walking the map are now documented; the map is not.

The Golden Gate Bridge was the part Anthropic was willing to show you. The rest of the neighborhood exists, and someone is going to decide which landmarks get moved, and on which version of the model that users never see.

∞

What If?

Feature steering works. That is the starting point. A production frontier model has directions in its activation space that correspond to interpretable concepts, and those directions can be clamped up or clamped down, and the model's output changes predictably. The Golden Gate Bridge was the charming example. The sycophancy features were the honest example. The next example, the one the map makes possible, is a deployed model in which the dials are turned by someone other than the company that trained it. Imagine a regulator that has negotiated, as a condition of deployment in a specific jurisdiction, the right to audit the feature activations of a hosted model in real time and to hold a small set of safety-relevant features — deception, sycophancy, a few hand-picked categories — below a fixed threshold on every inference. The lab retains the model. The regulator retains a list of feature IDs and a ceiling. A monitor between them rejects outputs that breach the ceiling. There is nothing in the Scaling Monosemanticity paper that makes this impossible. There is nothing in the Golden Gate Claude demonstration that makes it fanciful. Both the lab's ability to locate the feature and the lab's ability to move it are already in print. Now imagine the second jurisdiction doing the same thing, with a slightly different list. And the third, with a slightly different threshold. And a private enterprise deployment adding its own suppressions. And a platform running a public version with a different clamp profile than the model card describes. Each version shares weights. None of them behave the same. The map is navigable, and the number of people who can legally walk it is going to grow every year, and no two of them will walk it the same way. What becomes possible is real. A lab can debug a model feature by feature. A researcher can audit an internal representation against a named safety taxonomy. A deployment can hold a specific failure mode below a specific activation level, rather than hoping the training run suppressed it. These are not hypothetical; the methodology is published. What also becomes possible is that every downstream party with enough access earns the ability to quietly move the model in a different direction for a different user, and there is not yet a convention for which adjustments need to be disclosed and which do not. The question the demonstration leaves behind is not whether the map is real. The map is real. The question is who decides which landmarks get moved, on which copy of the model, for which users, and whether anyone outside the room has a way to know.

How did this land?

Sources

← Previous Chapter 11 Founder Mode on Cancer 8 min read Next → Chapter 13 The Scan That Reads Itself 7 min read

01 — Before the MapThe Black Box

02 — The MapScaling Monosemanticity

03 — A Logo on the Right-Hand SideTwenty-Four Hours

04 — The NeighborhoodWhat Else Was on the Map

05 — How Clean Is the MapThe Calibration

06 — The First Interior MapWhat the Demonstration Was For

Sources

More Stories