SpecInfer accelerates large language model serving by leveraging tree-based speculative inference and verification, which significantly reduces memory accesses to the LLM's parameters and end-to-end inference latency while preserving the same generative performance as incremental decoding.
AttentionStore, a new attention mechanism, enables the reuse of key-value caches across multi-turn conversations, significantly reducing the repetitive computation overheads and improving the inference performance and cost-efficiency of large language models.