You On AI Field Guide · Mechanistic Interpretability The You On AI Field Guide Home
Txt Low Med High
CONCEPT

Mechanistic Interpretability

The research program of reverse-engineering what is actually happening inside a neural network — the AI equivalent of the Rama explorers' attempt to understand an alien ship not by what it does but by taking it apart and naming its parts.
Mechanistic interpretability, or "mech interp," is the subfield of AI research that tries to identify, inside trained neural networks, the specific circuits and features that produce specific behaviors. It differs from behavioral interpretability — which asks what the model outputs under which inputs — by asking what the model's internal computation is. The ambition is to treat a trained network not as a black box whose properties must be probed by querying but as an artifact whose structure can be examined. In 2023–2025 the field produced its first major empirical successes: the isolation of individual features via sparse autoencoders, the identification of circuits that implement specific capabilities, and techniques for intervening on a model's internals in ways that predictably change its behavior.
Mechanistic Interpretability
Mechanistic Interpretability

In The You On AI Field Guide

The motivation is clearest through Clarke's Rama. Rama is a cylinder fifty kilometers long, manifestly engineered, manifestly purposeful, and wholly incomprehensible to

← Home 0%
CONCEPT Book →

Keep reading with YOU ON AI

Unlock the full book, field guide, and 555-thinker library. If you have a book code, register now — it takes a minute.

Register with book code Sign in