Fine-Tuning Large Language Models with Reinforcement Learning from Human Feedback
Reinforcement Learning from Human Feedback (RLHF) is an effective approach to aligning large language models (LLMs) with human preferences, but the reward model can suffer from inaccuracy due to distribution shift. This paper proposes Reward Learning on Policy (RLP), an unsupervised framework that refines the reward model using policy samples to keep it on-distribution, improving the overall RLHF performance.