Khái niệm cốt lõi
Introducing CAA for precise steering of language models by modifying activations, enhancing alignment techniques.
Thống kê
CAA computes "steering vectors" by averaging the difference in residual stream activations between pairs of positive and negative examples.
CAA significantly alters model behavior, effective over traditional methods.
CAA can be used on top of finetuning techniques to improve alignment properties.
Trích dẫn
"CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs)."