核心概念
SpecInfer accelerates large language model serving by leveraging tree-based speculative inference and verification, which significantly reduces memory accesses to the LLM's parameters and end-to-end inference latency while preserving the same generative performance as incremental decoding.
摘要
The paper introduces SpecInfer, a system that accelerates generative large language model (LLM) serving with tree-based speculative inference and verification. The key idea behind SpecInfer is leveraging small speculative models (SSMs) to predict the LLM's outputs, organizing the predictions as a token tree, and verifying the correctness of all candidate token sequences represented by the tree in parallel using a novel tree-based parallel decoding mechanism.
SpecInfer addresses two key challenges:
- Exploring an extremely large search space of candidate token sequences to maximize speculative performance. SpecInfer uses an expansion-based and a merge-based method to construct the token tree, exploiting diversity within a single SSM and across multiple SSMs.
- Verifying the speculated tokens while preserving the LLM's stochastic decoding behavior. SpecInfer introduces a multi-step speculative sampling algorithm that provably preserves the LLM's generative performance.
SpecInfer's tree-based speculative inference and verification provide two key advantages over the incremental decoding approach of existing LLM inference systems:
- Reduced memory accesses to LLM parameters by leveraging the overlap between speculated token trees and the LLM's actual output.
- Reduced end-to-end inference latency by enabling parallelization across different tokens in a single request.
The evaluation shows that SpecInfer outperforms existing LLM serving systems by 1.5-2.8× for distributed LLM inference and by 2.6-3.5× for offloading-based LLM inference, while preserving the same generative accuracy.
統計資料
The largest GPT-3 architecture has 175 billion parameters, requiring more than eight NVIDIA 40GB A100 GPUs to store in half-precision floating points, and takes several seconds to serve a single inference request.
SpecInfer can correctly predict the next 4 tokens on average.
引述
"SpecInfer accelerates generative large language model (LLM) serving with tree-based speculative inference and verification."
"The key idea behind SpecInfer is leveraging small speculative models (SSMs) to predict the LLM's outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence."
"SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which significantly reduces the end-to-end latency and computational requirement for serving generative LLMs while provably preserving model quality."