Concetti Chiave
This paper proposes an FPGA-based accelerator design to efficiently support the Convolution-Transformer hybrid architecture of the state-of-the-art efficient Vision Transformer, EfficientViT, by leveraging a reconfigurable architecture and a novel time-multiplexed and pipelined dataflow.
Sintesi
The paper presents an FPGA-based accelerator design for the efficient Vision Transformer (ViT) model called EfficientViT. EfficientViT features a Convolution-Transformer hybrid architecture, comprising lightweight convolutions (MBConvs) and a lightweight Multi-Scale Attention (MSA) module.
The key contributions are:
-
Reconfigurable Architecture Design:
- A reconfigurable processing element (RPE) architecture is designed to efficiently support various operation types in EfficientViT, including lightweight convolutions and lightweight attention.
- The RPE can operate in either DW mode for depthwise convolutions or PW mode for pointwise convolutions and generic convolutions.
-
Time-Multiplexed and Pipelined Dataflow:
- A novel time-multiplexed and pipelined (TMP) dataflow is proposed to fuse computations among adjacent lightweight convolutions and computations within the lightweight attention module.
- This dramatically boosts computing resource utilization while easing bandwidth requirements.
-
Accelerator Design and Evaluation:
- The proposed accelerator incorporates both the RPE engine and a MAT (multipliers and adder-trees) engine to efficiently execute the various operations in EfficientViT.
- Implemented on the Xilinx ZCU102 FPGA, the accelerator achieves up to 780.2 GOPS in throughput and 105.1 GOPS/W in energy efficiency, significantly outperforming prior works.
Statistiche
The proposed accelerator achieves up to 780.2 GOPS in throughput and 105.1 GOPS/W in energy efficiency when implemented on the Xilinx ZCU102 FPGA at 200MHz.
Citazioni
"To fully unleash its hardware benefit potential, it is highly desired to develop a dedicated accelerator for EffieicientViT, which, however, poses challenges due to its dynamic workloads and high-intensity memory access demands."
"Particularly, EfficientViT involves various operation types, including lightweight convolutions (i.e., MBConvs) with different kernel sizes, strides, and feature dimensions, as well as the lightweight attention (i.e., MSA), which exhibits distinct computational patterns compared to the vanilla self-attention in standard ViTs."