Training Language Models to Follow Instructions With Human Feedback (2022)
July 24, 2025Increasing the number of parameters does not inherently improve a model’s ability to follow users’ intent. Large Language Models (LLMs) are typically trained to predict the next token in large-scale internet text corpora, but this objective does not explicitly encourage models to follow user instructions in a helpful and safe manner. [Training Language Models to Follow Instructions with Human Feedback](https://arxiv.org/pdf/2203.02155 introduces a method for aligning models with human intent using Reinforcement Learning from Human Feedback (RLHF).
In human evaluations, outputs from InstructGPT — a 1.3 billion parameter GPT-3 model fine-tuned with RLHF — were preferred over outputs from the much larger 175 billion parameter GPT-3. Similar to Deep Reinforcement Learning from Human Preferences, the authors collected human preference data to train a reward model that ranks model outputs by desirability.
The RLHF process involved three main steps. First, a team of 40 labelers provided demonstrations of ideal outputs for prompts sourced from the OpenAI API, which were used to fine-tune GPT-3 via supervised learning. Next, the labelers created a dataset of human preference rankings by comparing multiple model-generated outputs for the same prompts. This dataset was used to train a reward model that assigns scalar scores to outputs based on human preferences. Finally, the policy model was fine-tuned using Proximal Policy Optimization (PPO), optimizing its outputs to maximize the reward model’s evaluations.