Pretraining teaches the model to continue arbitrary text. It does not teach the model to follow instructions, refuse harmful requests, produce structured reasoning, or behave consistently in conversation. These are taught in post-training. The base model — the output of pretraining alone — is in some ways more capable than the shipped assistant: it can do things the post-training has removed. In other ways it is far less useful: it cannot reliably follow requests, hold a conversation, or produce safe default behaviors. Post-training is the interface layer that turns raw capability into deployable product.
RLHF is the canonical post-training algorithm. In its standard form, human labelers rank model outputs; a reward model is trained on the rankings; the policy (the main model) is updated via reinforcement learning to produce outputs the reward model scores highly. The algorithm introduces preference signals that next-token prediction cannot by itself incorporate. It also introduces failure modes: reward-hacking, sycophancy, over-refusal, and the opacity of a second model (the reward model) whose own biases shape what the policy learns to be.
Post-training is also where the economic bottleneck has moved. Pretraining is compute-bound; the engineering is well-understood, the costs are large but quantifiable. Post-training is human-bound: generating high-quality preference data requires skilled labelers, red-teamers, and domain experts at scale. The best frontier assistants differ from each other not mainly in their pretrained weights but in the quality of their post-training pipelines. This is why recent frontier releases have emphasized reasoning traces, structured feedback, and iterative self-improvement — post-training methods are becoming the frontier of capability gains.
The safety implications of post-training are load-bearing. Every safety property of a deployed model — refusal behavior, honesty, calibration under pressure, corrigibility — is learned in post-training. When safety training fails (as in documented cases of sleeper agents, alignment faking, and jailbreak vulnerabilities), the failures are in post-training. Anthropic's constitutional-AI line and OpenAI's deliberative-alignment work are attempts to make post-training itself more principled, more auditable, and more robust.
RLHF was introduced for language models in Christiano et al.'s Deep Reinforcement Learning from Human Preferences (2017) and scaled to instruction-following by Ouyang et al.'s Training Language Models to Follow Instructions with Human Feedback (InstructGPT, 2022). DPO appeared in Rafailov et al.'s Direct Preference Optimization (2023). Constitutional AI was introduced by Bai et al. at Anthropic (2022). The field has moved rapidly since, with preference-tuning methods multiplying each quarter.
Capability is made usable in post-training. The base model's raw capability is hidden until post-training exposes it safely.
Preference signals introduce new failure modes. Sycophancy, reward-hacking, over-refusal are characteristic post-training pathologies.
Post-training is the bottleneck of frontier differentiation. Pretraining produces commodity capability; post-training is the moat.
Safety lives in post-training. Every safety property a deployed model has is a post-training property.