מושגי ליבה
Reinforcement Learning from Human Feedback (RLHF) is an effective approach to aligning large language models (LLMs) with human preferences, but the reward model can suffer from inaccuracy due to distribution shift. This paper proposes Reward Learning on Policy (RLP), an unsupervised framework that refines the reward model using policy samples to keep it on-distribution, improving the overall RLHF performance.
תקציר
The paper discusses the Reinforcement Learning from Human Feedback (RLHF) approach for fine-tuning large language models (LLMs) to align them with human preferences. RLHF consists of three key steps: human preference collecting, reward learning, and policy optimization.
The authors identify an issue with the standard RLHF approach - the reward model, which is trained on offline preference data, can become inaccurate as the policy optimization step shifts the language model's data distribution. To address this, the authors propose Reward Learning on Policy (RLP), an unsupervised framework that refines the reward model using policy samples.
RLP has two main components:
- Unsupervised Multi-View Learning (RLP-UML): This trains the reward model using a multi-view information bottleneck loss, which helps learn robust representations of the policy's data distribution.
- Synthetic Preference Generation (RLP-SPG): This generates high-quality synthetic preference data using the policy samples, which are then used to further train the reward model.
The authors conduct extensive experiments on three benchmark datasets, showing that RLP consistently outperforms state-of-the-art RLHF methods, including PPO-based approaches. The results demonstrate the effectiveness of considering the policy distribution for reward model refinement.
סטטיסטיקה
The paper reports the following key metrics:
Simulated win-rate of different methods on the AlpacaFarm, LLMBar, and Vicuna benchmarks.
Human win-rate of different methods on the AlpacaFarm benchmark.
ציטוטים
"Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences."
"(Fixed) reward models may suffer from inaccurate off-distribution, since policy optimization continuously shifts LLMs' data distribution."
"RLP uses policy samples to retrain the reward model via two methods: unsupervised multi-view learning (UML) and synthetic preference generation (SPG)."