Preference Poisoning: Manipulating Language Models through Injected Poisoned Preference Data
An attacker can manipulate the behavior of a language model trained with RLHF by injecting a small amount of poisoned preference data into the training process, causing the model to generate more text containing a target entity in a desired sentiment.