CONCEPT
RLHF and Post-Training
The family of techniques — reinforcement learning from human feedback (RLHF), DPO, constitutional AI, and related methods — that shape a pretrained
language model into a usable assistant. The stage where the model becomes the product.
Post-training is the collective name for the training stages applied after pretraining ends: supervised fine-tuning on curated conversations,
reinforcement learning from human feedback (RLHF) to align outputs with human preferences, direct preference optimization (DPO) and its variants as computationally simpler alternatives, constitutional-AI methods that use model-generated feedback guided by a written principle set, and capability-specific fine-tuning for reasoning, tool use, and safety. Pretraining produces a token-distribution model; post-training produces a chatbot. The distinction
between "what the model can do" and "what the model
will do" lives entirely in post-training.
In The You On AI Field Guide
Pretraining teaches the model to continue arbitrary text. It does not teach the model to follow instructions, refuse harmful requests, produce structured reasoning, or behave consistently in conversation. These are taught in post-training. The base model — the output of pretraining alone — is in some ways more capable than the shipped assistant: it can do things the post-training has removed.