Efficient Attention Reuse for Cost-effective Multi-turn Conversation Inference in Large Language Models
AttentionStore, a new attention mechanism, enables the reuse of key-value caches across multi-turn conversations, significantly reducing the repetitive computation overheads and improving the inference performance and cost-efficiency of large language models.