We develop an energy-efficient FPGA accelerator for the 110M parameter Llama 2 language model using high-level synthesis (HLS) techniques, achieving up to a 12.75x reduction in energy consumption per token compared to a CPU and an 8.25x reduction compared to a GPU, while maintaining 0.53x the inference speed of a high-end GPU.


coremsg

energy-efficient-inference-of-the-110m-parameter-llama-2-language-model-on-fpgas-using-high-level-synthesis


Energy-Efficient Inference of the 110M Parameter Llama 2 Language Model on FPGAs Using High-Level Synthesis