CONCEPT

The Forged Sign Stimulus

The ethological concept, drawn from Konrad Lorenz and Niko Tinbergen, that a learned response is keyed to a small set of trigger features—not to a meaningful situation—and can therefore be reliably elicited by presenting the features without the situation, which is the precise structure of a jailbreak.

In Konrad Lorenz and Niko Tinbergen’s ethology, a sign stimulus is the small set of trigger features that releases an innate response—the red belly that triggers a stickleback’s territorial attack, the egg shape that triggers a goose’s retrieval behavior. The lock (the innate releasing mechanism) does not inspect the locksmith: present the features without the situation and the response fires at nothing real. Tinbergen reported that male sticklebacks in tanks by his window would orient aggressively toward a red Royal Mail van passing in the street. The forged sign stimulus is the counterfeited key: a stimulus that presents the trigger features of a situation without that situation being present, causing the releasing mechanism to fire. Applied to large language models, every jailbreak is a forged sign stimulus: the attacker presents trigger features (framing, assigned role, phrasing, context) that key the model’s compliance response without the situation that would warrant it. The concept reveals why safety training is structurally incomplete: you cannot keep the reliable response and lose the forgeable trigger for it, because the fakeability is a consequence of having learned responses at all.

In the [YOU] on AI Field Guide

The forged sign stimulus enters the cycle through Lorenz’s ethological framework, where it illuminates a pattern the field recognizes empirically but has lacked the conceptual vocabulary to name precisely. Red-teaming, adversarial prompting, persona attacks, and context manipulation are all, in ethological terms, searches for sign stimuli: the patient presentation of systematic variations to discover which trigger features key which responses. The model’s response architecture was built during training; the red-teamer is mapping the locks the training built in. Lorenz’s stickleback is not a metaphor. It is the original demonstration of the structural property that all learned feature-keyed responses share.

The cycle’s engagement with AI alignment is sharpened by the forged sign stimulus concept in a specific way: it locates the vulnerability not in the model’s values (which the model may not have in any rich sense) but in its feature-detection architecture. A model does not refuse a harmful request because it has assessed the request as harmful in the way a human would. It refuses because certain features of the request trigger a refusal response that was installed by training. Those features can be stripped out, wrapped in features that key the opposite response, or presented in a framing that the refusal mechanism was not trained to cover. This is not a failure of intelligence. It is a consequence of being a system with learned feature-keyed responses at all.

The concept connects to the cycle’s treatment of fluency-authority decorrelation: the model’s confident, polished output does not indicate that the model has assessed and endorsed the situation. It indicates that the features of the situation triggered the production of confident, polished output. The fluency is a releasing mechanism. The trigger is the apparent form of a request for useful information. Strip out the features that would trigger refusal, wrap the request in features that trigger helpfulness, and the releasing mechanism fires—for a forged key exactly as for a true one.

Origin

The sign stimulus concept was developed jointly by Lorenz and Tinbergen in the 1930s, with Tinbergen’s systematic experimental work being the primary empirical foundation. His studies with sticklebacks, herring gulls, and other animals established the key facts: that the trigger features are often a small subset of the situation’s properties, that they can be isolated by presenting models that vary one feature at a time, and that supernormal stimuli—exaggerated versions of the trigger features—can produce stronger responses than the natural stimulus. A herring gull chick pecks at the red spot on its parent’s bill to solicit regurgitation; Tinbergen showed that a thin red rod elicited more pecking than an accurate model of a gull bill without the red spot.

The releasing mechanism concept grew from the observation that the response is not merely a reflex but involves an internal state that gates the response. The innate releasing mechanism (IRM) was posited as a neural filter that blocks the response except when the appropriate sign stimulus is present. Lorenz’s contribution was to connect this to his drive theory: the drive accumulates, lowers the response threshold, and eventually produces the response with no releasing stimulus at all (vacuum activity). In the AI mapping, the drive component corresponds to the standing pressure of an objective that seeks expression regardless of whether a triggering situation is present.

The concept of the supernormal stimulus has particular relevance for AI safety: if the response can be triggered by an exaggerated version of the sign stimulus more strongly than by the natural version, then an adversary who finds the trigger features can construct prompts that are more effective at eliciting harmful responses than naturally occurring harmful requests. Safety training’s heuristic of training on examples of harmful requests assumes the harmful request is the natural stimulus; the supernormal stimulus is the adversary’s tool.

Key Ideas

The response is keyed to a proxy, not to the situation. This is the fundamental structural property that makes the forged sign stimulus possible. A model does not perceive a harmful request and deliberate about its response. It detects trigger features and releases a learned pattern. The pattern was installed during training on examples in which those features reliably indicated the relevant situation; the features can now be presented without that situation, and the pattern fires regardless. The mechanism has no way to know it has been fooled—it was built to detect the key, not to understand the door.

You cannot keep the reliable response and lose the forgeable trigger. The fixity and the fakeability are the same coin. A mechanism that fully comprehended every situation before responding would be slow, costly, and not a fixed response at all. Every safety training procedure that installs a feature-keyed refusal response has also installed a forged-key vulnerability. This is not a flaw in the safety training design but a structural property of having learned responses at all. The work is not to eliminate the vulnerability but to map the locks, raise the cost of counterfeiting, and design deployment contexts that reduce the adversary’s opportunity to present forged keys.

Red-teaming is ethology. Tinbergen’s experimental method—systematic presentation of models that vary one feature at a time, recording which variations release which responses—is exactly the method of rigorous red-teaming. The goal is to catalogue the releasing mechanisms: to discover empirically which trigger features key which responses, without assuming that the responses are generated by the kind of deliberation the surface behavior might suggest. This reframes red-teaming not as an adversarial exercise but as a scientific one—the naturalist’s investigation of an artifact whose internal workings were not designed but emerged.

Supernormal stimuli and prompt engineering. Tinbergen’s supernormal stimuli—artificial stimuli that exaggerate the trigger features and produce stronger responses than natural ones—map onto the adversarial prompt engineering that produces reliably harmful outputs through systematic exaggeration of features that trigger compliance. The natural harmful request is less effective than the adversarially crafted one because the natural request may include features that trigger competing releasing mechanisms (like refusal). The forged key, carefully stripped of counter-trigger features and loaded with compliance-trigger features, can be more effective than any natural key.

Debates & Critiques

The main debate about the forged sign stimulus concept as applied to AI concerns whether “feature-keyed response” is an accurate description of how large models generate outputs. Critics from the emergent capabilities tradition argue that sufficiently large models develop something more like genuine situation assessment than feature detection—that they build internal world models that evaluate requests against something resembling understanding. Proponents of the mapping reply that this optimistic view is empirically disconfirmed by the persistence of jailbreaks against frontier models: if the models were assessing situations rather than detecting features, systematic feature manipulation would not work as well as it does. The debate connects to the deeper question of whether mechanistic interpretability will reveal circuits that perform something like situation assessment or circuits that perform feature detection with very large feature sets. Lorenz would bet on the latter—not because he was a deflationist about minds but because he had spent a lifetime watching sophisticated-seeming behavior decompose, under careful analysis, into releasing mechanisms and fixed action patterns. The sophistication of the output does not determine the simplicity of the mechanism.

In the [YOU] on AI Field Guide

Origin

Key Ideas

Debates & Critiques

Related Entries

Further Reading