Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano
As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about—summary quality. In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. We apply our method to a version of the TL;DR dataset of Reddit posts and find that our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles, producing summaries nearly as good as the human reference without any news-specific fine-tuning. We conduct extensive analyses to understand our human feedback dataset and fine-tuned models. We establish that our reward model generalizes to new datasets, and that optimizing our reward model results in better summaries than optimizing ROUGE according to humans. We hope the evidence from our paper motivates machine learning researchers to pay closer attention to how their training loss affects the model behavior they actually want.
As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics u...
Large-scale language model pretraining has become increasingly prevalent for achieving high performance on a variety of ...
Most directly related to our work is previous work using human feedback to train summarization models with RL. Bohm et a...
Our approach is similar to the one outlined in [73], adapted to the batch setting. We start with an initial policy that ...
Datasets. We use the TL;DR summarization dataset, which contains ~3 million posts from reddit.com across a variety of to...
Previous work on fine-tuning language models from human feedback reported 'a mismatch between the notion of quality we w...
All of our models are Transformer decoders in the style of GPT-3. We conduct our human feedback experiments on models wi...
Reward models. To train our reward models, we start from a supervised baseline, as described above, then add a randomly ...
Policies trained with human feedback are preferred to much larger supervised policies. Our main results evaluating our h...
Our human feedback models can also generate excellent summaries of CNN/DM news articles without any further training (Fi...
What happens as we optimize the reward model? Optimizing against our reward model is supposed to make our policy align w...
Evaluation. We study how well various automatic metrics act as predictors for human preferences, and compare them to our...
Limitations. One limitation of our work is the time and cost required to produce our final models. Notably, fine-tuning ...
Here, we discuss the pre-processing steps that we apply to the TL;DR dataset. We first remove all duplicate posts by che...
B.1 Hyperparameters. All models follow the standard Transformer architecture, with 2048 learned position embeddings. All...
C.1 Process for ensuring high-quality human data. We first detail the procedures we use to ensure high-quality data. Whi...
In testing our human feedback techniques, we collected a large amount of high-quality data from human labelers. In order...
As discussed in Section 4.1, the length of a summary is a confounding factor for evaluating summary quality; depending o...
G.1 Value function ablation. In this section, we conduct an ablation comparing using separate parameters for the value f...