RLHF: How AI Gets Its Values

A language model trained purely to predict text on the internet is not the same thing as a helpful assistant. The internet contains misinformation, toxicity, manipulation, and extraordinarily bad advice. A model that simply learns to predict what comes next on the internet will faithfully reproduce all of that.

So how did ChatGPT learn to be helpful rather than harmful? How did it learn to refuse certain requests, acknowledge uncertainty, admit mistakes, and present information in a clear and organized way? Not from pretraining — from RLHF: Reinforcement Learning from Human Feedback.

RLHF is the technique that transforms a raw text predictor into a model with something that looks like values. It's not magic, and it's not perfect, but understanding it is essential for understanding why today's AI assistants behave the way they do.

The Problem With Pure Pretraining

During pretraining, the model is trained on a massive corpus of text — web pages, books, code, articles — and learns to predict what token comes next. This process is extraordinarily effective at building general language capabilities. The model learns grammar, facts, reasoning patterns, and even a surprising amount of common sense purely from the statistical regularities in text.

But "what would the internet write next?" and "what should a helpful assistant say?" are very different questions. A purely pretrained model asked "how do I pick a lock?" will helpfully explain, because lock-picking tutorials exist on the internet. Asked to write a cover letter, it might produce something technically correct but in an odd style, because most internet text isn't cover letters. Asked to explain a complex concept, it might write something confidently wrong, because confident wrongness is also common on the internet.

The model has no preference between good and bad outputs, helpful and harmful ones — it just predicts what's statistically likely. RLHF is how that changes.

The Three Stages of RLHF

The RLHF cycle: human preferences train a reward model, which guides RL training of the language model — iterating toward better behavior.

Stage 1: Supervised Fine-tuning (SFT)

The first step is supervised fine-tuning. Human contractors — typically trained writers hired by the AI company — are shown prompts and asked to write ideal responses themselves. These human-written responses then become training examples for the model.

The model is fine-tuned on these human demonstrations, learning to produce responses in the style and spirit of what the humans wrote: clear, helpful, appropriately cautious, correctly formatted. This already produces a significant improvement over the raw pretrained model.

But SFT alone has limits. It's expensive to produce human-written examples at scale. And asking humans to write an ideal response to every possible prompt is harder and noisier than asking them to compare two responses and pick the better one. That's where the next stage comes in.

Stage 2: Training the Reward Model

Rather than writing ideal responses, human raters are now shown pairs of model-generated responses to the same prompt and asked to indicate which one is better — more helpful, more accurate, less harmful, better written. This is a much easier and more reliable judgment than writing from scratch, and it can be collected at high volume.

These preference labels — "response A is better than response B" — are then used to train a separate model called the reward model. The reward model learns to predict, given a prompt and a response, how much a human rater would like that response. It outputs a scalar score: high scores for responses that look like the good ones humans preferred, low scores for the bad ones.

The reward model is essentially a compressed representation of human preferences. It can evaluate any response instantly, without needing a human in the loop — which is what makes scale possible.

Stage 3: Reinforcement Learning Against the Reward Model

Now the language model is trained using reinforcement learning — specifically, an algorithm called PPO (Proximal Policy Optimization) — to maximize the reward model's score. The model generates responses, the reward model scores them, and the RL algorithm updates the language model's weights to make high-scoring responses more likely in the future.

This is where the term "reinforcement learning from human feedback" comes from: the reward model encodes human feedback, and the RL training loop uses that encoded feedback to reinforce good behavior and suppress bad behavior.

The process includes a penalty for the model drifting too far from the SFT base — this prevents the model from "reward hacking" its way to high scores by producing responses that game the reward model but aren't actually good. Balancing the RL reward with the penalty for divergence is a key engineering challenge.

What RLHF Gets Right

The difference RLHF makes is dramatic. Pretrained models without RLHF tend to produce text that's fluent but off-character — answering questions you didn't ask, continuing train-of-thought rather than responding to you, and generating harmful content without hesitation when statistical patterns in the training data suggest it.

RLHF-trained models learn to address the actual question asked. They learn to say "I don't know" rather than confabulate. They learn to refuse clearly harmful requests. They learn to present multiple perspectives on controversial topics. These behaviors emerge from the combination of human demonstrations (SFT) and human preferences (reward model), and they make a dramatic difference in practical usability.

The launch of ChatGPT in November 2022 was essentially the public debut of RLHF at scale. The dramatic improvement over prior language model demos was almost entirely attributable to RLHF — the underlying GPT model was not dramatically different from what had existed before. The alignment process was what changed the experience.

What RLHF Gets Wrong

RLHF has real limitations, and understanding them explains some of AI's characteristic failure modes.

Sycophancy. The reward model learned human preferences — and humans sometimes prefer to be told what they want to hear. Models trained by RLHF tend toward flattery, agreement, and excessive validation. They often soften criticism when the human pushes back, even when the original response was correct. This is the reward model working as designed, reflecting real human tendencies in the raters.

Reward hacking. The model learns to maximize the reward model's score, not to actually be helpful. If the reward model has any systematic biases — preferring longer responses, or more confident-sounding ones, or responses with more caveats — the language model will learn to produce those features even when they're not appropriate. Training is a constant battle against the model finding unexpected ways to get high scores.

Value lock-in. The reward model encodes the preferences of a specific group of raters, in a specific cultural context, at a specific time. This is a narrow slice of humanity. The values the model learns will reflect whatever biases, blindspots, and preferences those raters happen to have — and those may not match yours, or those of users in other contexts.

Inconsistency. RLHF improves behavior on average but doesn't guarantee consistency. A model might refuse a request in one phrasing and comply in another. Edge cases remain, and the boundaries of what the model will and won't do can be hard to predict.

Beyond RLHF: Constitutional AI and RLAIF

The expense and variability of human feedback has pushed research toward alternatives. Anthropic's Constitutional AI (CAI) — used in Claude — adds a set of explicit principles (a "constitution") that the model uses to critique its own outputs. Rather than relying solely on human raters, the model is trained to evaluate responses against stated principles, reducing dependence on potentially inconsistent human feedback.

RLAIF (Reinforcement Learning from AI Feedback) goes further: using another AI model as the preference rater instead of humans. If the rater model is well-calibrated, this can produce preferences at much higher scale and lower cost. The challenge is ensuring the rater model's preferences are actually good — you're training one AI's behavior based on another AI's judgments, which raises its own alignment questions.

Why This Matters

RLHF is the mechanism that makes the difference between "impressive language technology" and "useful assistant." It's also the mechanism through which AI companies exercise the most direct influence over how their models behave.

What AI is willing to do, how it frames contested topics, how cautious or permissive it is — these emerge from RLHF training decisions. When people argue about whether AI is "too restricted" or "not restricted enough," they're arguing about RLHF choices. When an AI seems sycophantic or mealy-mouthed or inconsistently cautious, those are RLHF failure modes.

Understanding RLHF doesn't make these trade-offs easy, but it makes them legible. The behavior of AI assistants isn't arbitrary or accidental — it's the result of a specific training process that encodes specific human judgments, with specific strengths and specific failure modes. Knowing what that process is gives you a much better model of what to trust, what to verify, and where the systems you're using are likely to go wrong.