核心概念
Constrained DPO (C-DPO) enhances LLM safety efficiently and effectively.
摘要
The content discusses the urgent need to align AI systems with diverse human preferences to enhance their usefulness and safety. It introduces Constrained DPO (C-DPO) as a novel extension of Direct Preference Optimization (DPO) for fine-tuning LLMs. By integrating dual gradient descent and DPO, C-DPO identifies an optimal trade-off between helpfulness and harmlessness without using reinforcement learning. The approach provides a safety guarantee to LLMs missing in DPO while achieving higher rewards under the same safety constraint compared to other approaches. The paper also contains examples of offensive or harmful data.
-
Introduction
- Large language models (LLMs) proficiency and vulnerabilities.
- Techniques like supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF).
-
Preliminaries
- Overview of RLHF and safe RLHF.
-
Method
- Introduction of safe RLHF framework.
- Proposal of Constrained DPO (C-DPO) for aligning LLMs with dual objectives.
-
Experiments
- Evaluation of C-DPO against baselines like SFT, DPO, and Beaver-v1.
- Comparative analysis of model performances on the test dataset.
-
Related Work
- Discussion on LLMs alignment, RLHF, and safe reinforcement learning.
-
Appendix
- Analytical results on strong duality, deriving optimum to unconstrained objective, equivalence of safe RLHF and maximum likelihood objective, and gradient of dual function.
- Details about the Constrained DPO (C-DPO) algorithm and experiment set-up.
統計資料
"Our method identifies a nearly optimal trade-off between helpfulness and harmlessness without using reinforcement learning."
"C-DPO provides a safety guarantee to LLMs missing in DPO while achieving higher rewards under the same safety constraint."
引述
"Our goal in this work is to develop a more scalable fine-tuning framework for improving LLM safety."
"C-DPO with λ = 0.4 emerges as the optimal policy in the present context where the Climit = 0."