The post-training technique that transformed GPT-3 into ChatGPT — and, as Harvard's Kempner Institute observed, a Skinner box operating on neural networks with human preference ratings as the reinforcing consequence.
Reinforcement learning from human feedback (RLHF) is the training procedure in which a pretrained language model's outputs are rated by human evaluators, a reward model is trained on those ratings, and the language model is subsequently fine-tuned through reinforcement learning to maximize predicted reward. The technique was the decisive innovation that converted impressive but unwieldy models like GPT-3 into usable conversational assistants. Structurally, RLHF implements Skinner's operant paradigm with unusual directness: an organism (the neural network) emits responses, a reinforcing consequence (the reward signal derived from human ratings) is delivered contingent on the response, and the organism's future response probabilities are modified accordingly. The Skinner volume uses this structural identity as its analytical lever: systems trained by operant principles implement operant contingencies on their users.
Reinforcement Learning from Human Feedback
In The You On AI Field Guide
RLHF emerged from research at OpenAI, DeepMind, and Anthropic in the late 2010s, building on earlier work in preference-based reinforcement learning and inverse reinforcement learning. The breakthrough