CONCEPT

RLHF and Post-Training

The family of techniques — reinforcement learning from human feedback (RLHF), DPO, constitutional AI, and related methods — that shape a pretrained language model into a usable assistant. The stage where the model becomes the product.

Post-training is the collective name for the training stages applied after pretraining ends: supervised fine-tuning on curated conversations, reinforcement learning from human feedback (RLHF) to align outputs with human preferences, direct preference optimization (DPO) and its variants as computationally simpler alternatives, constitutional-AI methods that use model-generated feedback guided by a written principle set, and capability-specific fine-tuning for reasoning, tool use, and safety. Pretraining produces a token-distribution model; post-training produces a chatbot. The distinction between "what the model can do" and "what the model will do" lives entirely in post-training.

In The You On AI Field Guide

Pretraining teaches the model to continue arbitrary text. It does not teach the model to follow instructions, refuse harmful requests, produce structured reasoning, or behave consistently in conversation. These are taught in post-training. The base model — the output of pretraining alone — is in some ways more capable than the shipped assistant: it can do things the post-training has removed.

In The You On AI Field Guide

Keep reading with YOU ON AI