The paper investigates the feasibility and potential of model-specific spatial acceleration for large language model (LLM) inference on FPGAs. It introduces an analytical framework to estimate the performance of a spatial LLM accelerator, considering on-chip compute and memory resources. The framework can be extended to multi-FPGA settings for distributed inference.
The key highlights and insights are:
Temporal FPGA architectures often encounter challenges in achieving low latency due to considerable memory access overhead, while spatial architectures can reduce off-chip memory accesses and enable pipelined processing.
The generative inference process of LLMs consists of two distinct stages - prefill and decode - with significantly different computational and memory characteristics, requiring tailored hardware acceleration.
The analytical framework can identify the most effective parallelization and buffering schemes for the accelerator and determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart.
The paper provides a library of high-level synthesis (HLS) kernels that are composable and reusable to enable more productive implementations of LLM models on FPGAs.
The implemented FPGA-based LLM accelerators achieve up to 13.4x speedup for BERT and 2.2x speedup in the prefill stage and 1.9x speedup in the decode stage for GPT generative inference, compared to previous FPGA and GPU-based accelerators.
Para Outro Idioma
do conteúdo original
arxiv.org
Principais Insights Extraídos De
by Hongzheng Ch... às arxiv.org 04-09-2024
https://arxiv.org/pdf/2312.15159.pdfPerguntas Mais Profundas