This paper compares RLHF and DPO paradigms for learning from human preferences. RLHF involves reward learning followed by policy optimization, while DPO directly optimizes policy parameters. The study delves into the statistical guarantees, sample complexity, convergence rates, and implications of both approaches. Key findings include the impact of reward and policy dimensions, sample size, regularization temperature, and the role of mismatch coefficients in non-realizable rewards.
The authors provide theoretical results for exact optimization settings in contextual bandits and deterministic Markov decision processes (MDPs). They analyze the suboptimality gap induced by both paradigms under various conditions. The discussion extends to approximate optimization settings with insights on gradient descent procedures for reward learning and policy optimization phases.
Implications suggest that RLHF outperforms DPO when reward dimensions are smaller than policy dimensions or for smaller sample sizes. DPO's performance improves asymptotically with larger samples but is disproportionately affected by the regularization temperature beta. The study also explores extensions to MDPs with linear rewards and loglinear policies.
Future directions include analyzing general function approximation classes for policies, conducting large-scale empirical comparisons, and extending the analysis to broader MDP scenarios.
A otro idioma
del contenido fuente
arxiv.org
Consultas más profundas