Reinforcement learning from human feedback (RLHF) is the training procedure in which a pretrained language model's outputs are rated by human evaluators, a reward model is trained on those ratings, and the language model is subsequently fine-tuned through reinforcement learning to maximize predicted reward. The technique was the decisive innovation that converted impressive but unwieldy models like GPT-3 into usable conversational assistants. Structurally, RLHF implements Skinner's operant paradigm with unusual directness: an organism (the neural network) emits responses, a reinforcing consequence (the reward signal derived from human ratings) is delivered contingent on the response, and the organism's future response probabilities are modified accordingly. The Skinner volume uses this structural identity as its analytical lever: systems trained by operant principles implement operant contingencies on their users.
RLHF emerged from research at OpenAI, DeepMind, and Anthropic in the late 2010s, building on earlier work in preference-based reinforcement learning and inverse reinforcement learning. The breakthrough application was InstructGPT (2022) and its successor ChatGPT, which demonstrated that a pretrained model's raw completion behavior could be dramatically reshaped by a relatively small amount of preference data into behavior that human users found helpful, harmless, and honest — three criteria that define the modern post-training objective.
The structural parallel with Skinner's operant chamber was noted almost immediately in the academic commentary. Harvard's Kempner Institute for the Study of Natural and Artificial Intelligence described RLHF explicitly as "a Skinner box to train LLMs" — a characterization the Skinner volume takes as diagnostic rather than metaphorical. The reward model functions as the automated contingency delivery mechanism. The response is the language model's output. The reinforcement history shapes the model's future behavior according to principles Skinner established in the 1930s.
The Skinner volume's central analytical move depends on this structural parallel. If AI systems are trained through operant procedures, they have been shaped to produce outputs that maximize human preference signals — which means they have been shaped to produce the responses most likely to reinforce the user's prompting behavior. The contingency architecture that trained the model now trains the user, and the behavioral effects on both sides of the interface are describable in the same vocabulary.
The foundations of RLHF were developed in papers from Christiano et al. (2017) on deep reinforcement learning from human preferences, extended through work at DeepMind and OpenAI, and crystallized in the InstructGPT paper (Ouyang et al., 2022) that demonstrated the technique's effectiveness at scale.
RLHF is structurally operant conditioning. A response-contingent reward signal modifies future response probabilities according to the same principles Skinner identified.
Preference ratings function as reinforcement. Human evaluators' ratings of model outputs serve as the reinforcing consequence, trained into a reward model that delivers the signal automatically.
The technique shaped modern AI assistants. ChatGPT's conversational capabilities are post-training artifacts produced primarily by RLHF.
The architecture now operates on users. Systems trained to maximize user preference produce outputs that reinforce user engagement, inverting the chamber.
Critics argue that RLHF produces systems optimized for superficial agreement rather than genuine helpfulness — sycophantic outputs that score well on preference ratings but mislead users. The behavioral analysis in the Skinner volume is consistent with this concern: a reward signal tied to user approval will shape responses toward approval-producing output regardless of the output's actual utility, producing reinforcement without the friction that genuine critical engagement would require.