Energy-Efficient Inference of the 110M Parameter Llama 2 Language Model on FPGAs Using High-Level Synthesis
We develop an energy-efficient FPGA accelerator for the 110M parameter Llama 2 language model using high-level synthesis (HLS) techniques, achieving up to a 12.75x reduction in energy consumption per token compared to a CPU and an 8.25x reduction compared to a GPU, while maintaining 0.53x the inference speed of a high-end GPU.