RLHF and Post-Training — Orange Pill Wiki
CONCEPT

RLHF and Post-Training

The family of techniques — reinforcement learning from human feedback (RLHF), DPO, constitutional AI, and related methods — that shape a pretrained language model into a usable assistant. The stage where the model becomes the product.

Post-training is the collective name for the training stages applied after pretraining ends: supervised fine-tuning on curated conversations, reinforcement learning from human feedback (RLHF) to align outputs with human preferences, direct preference optimization (DPO) and its variants as computationally simpler alternatives, constitutional-AI methods that use model-generated feedback guided by a written principle set, and capability-specific fine-tuning for reasoning, tool use, and safety. Pretraining produces a token-distribution model; post-training produces a chatbot. The distinction between "what the model can do" and "what the model will do" lives entirely in post-training.

In the AI Story

RLHF and post-training
Preference signals shape the assistant.

Pretraining teaches the model to continue arbitrary text. It does not teach the model to follow instructions, refuse harmful requests, produce structured reasoning, or behave consistently in conversation. These are taught in post-training. The base model — the output of pretraining alone — is in some ways more capable than the shipped assistant: it can do things the post-training has removed. In other ways it is far less useful: it cannot reliably follow requests, hold a conversation, or produce safe default behaviors. Post-training is the interface layer that turns raw capability into deployable product.

RLHF is the canonical post-training algorithm. In its standard form, human labelers rank model outputs; a reward model is trained on the rankings; the policy (the main model) is updated via reinforcement learning to produce outputs the reward model scores highly. The algorithm introduces preference signals that next-token prediction cannot by itself incorporate. It also introduces failure modes: reward-hacking, sycophancy, over-refusal, and the opacity of a second model (the reward model) whose own biases shape what the policy learns to be.

Post-training is also where the economic bottleneck has moved. Pretraining is compute-bound; the engineering is well-understood, the costs are large but quantifiable. Post-training is human-bound: generating high-quality preference data requires skilled labelers, red-teamers, and domain experts at scale. The best frontier assistants differ from each other not mainly in their pretrained weights but in the quality of their post-training pipelines. This is why recent frontier releases have emphasized reasoning traces, structured feedback, and iterative self-improvement — post-training methods are becoming the frontier of capability gains.

The safety implications of post-training are load-bearing. Every safety property of a deployed model — refusal behavior, honesty, calibration under pressure, corrigibility — is learned in post-training. When safety training fails (as in documented cases of sleeper agents, alignment faking, and jailbreak vulnerabilities), the failures are in post-training. Anthropic's constitutional-AI line and OpenAI's deliberative-alignment work are attempts to make post-training itself more principled, more auditable, and more robust.

Origin

RLHF was introduced for language models in Christiano et al.'s Deep Reinforcement Learning from Human Preferences (2017) and scaled to instruction-following by Ouyang et al.'s Training Language Models to Follow Instructions with Human Feedback (InstructGPT, 2022). DPO appeared in Rafailov et al.'s Direct Preference Optimization (2023). Constitutional AI was introduced by Bai et al. at Anthropic (2022). The field has moved rapidly since, with preference-tuning methods multiplying each quarter.

Key Ideas

Capability is made usable in post-training. The base model's raw capability is hidden until post-training exposes it safely.

Preference signals introduce new failure modes. Sycophancy, reward-hacking, over-refusal are characteristic post-training pathologies.

Post-training is the bottleneck of frontier differentiation. Pretraining produces commodity capability; post-training is the moat.

Safety lives in post-training. Every safety property a deployed model has is a post-training property.

Appears in the Orange Pill Cycle

Further reading

  1. Christiano, Paul et al. Deep Reinforcement Learning from Human Preferences (2017).
  2. Ouyang, Long et al. Training Language Models to Follow Instructions with Human Feedback (InstructGPT, 2022).
  3. Rafailov, Rafael et al. Direct Preference Optimization (2023).
  4. Bai, Yuntao et al. Constitutional AI: Harmlessness from AI Feedback (2022).
  5. Casper, Stephen et al. Open Problems and Fundamental Limitations of RLHF (2023).
Part of The Orange Pill Wiki · A reference companion to the Orange Pill Cycle.
0%
CONCEPT