Belangrijkste concepten
A novel weight packing algorithm that minimizes weight loading overheads and maximizes computational parallelism in in-memory computing accelerators.
Samenvatting
The paper presents a weight packing algorithm for in-memory computing (IMC) accelerators to address two key challenges:
-
Weight loading overhead: Fetching weights from external memory incurs significant energy and latency penalties. The proposed method packs weights densely in the IMC array to minimize reloading.
-
Underutilization of computational parallelism: The weight mapping scheme impacts the ability to exploit the inherent parallelism in the IMC array. The algorithm aims to maximize the utilization of the available computational resources.
The weight packing algorithm works in three steps:
- Tile generation: Defines a pool of weight tiles that fit in the IMC array dimensions (Di, Do, Dh, Dm).
- Supertile generation: Combines the tiles into denser supertiles, with constraints to maintain spatial parallelism across layers.
- Column generation and allocation: Packs the supertiles into dense columns and allocates them across the IMC macros.
The proposed method is evaluated on the MLPerf Tiny benchmark and compared to baseline weight mapping techniques. It demonstrates up to 100x improvements in energy-delay-product (EDP) by effectively mitigating weight loading overheads while maximizing computational utilization.
The paper also analyzes the area-EDP trade-offs by sweeping the Dh and Dm design parameters, showing the benefits of the weight packing approach in area-constrained scenarios.
Statistieken
In-memory computing (IMC) designs can achieve 10x improvements in peak efficiency and performance for matrix-vector multiplications compared to conventional digital designs.
Weight loading from external memory incurs significant energy (10s pJ/bit) and latency overheads.
The proposed weight packing method can achieve up to 100x improvements in energy-delay-product (EDP) compared to baseline weight mapping techniques.
Citaten
"Weight loading affects both energy consumption and latency. Energy-wise, each loading requires fetching data from outside the IMC macro, reshuffle it so as to present it in the right alignment and load it in the memory array, with a large word parallelism. Latency-wise, weight loading and computation can not occur in parallel within one memory macro and this causes intrinsic stalls whenever the weight values have to be updated."
"To face these issues it is required to act both on the hardware architecture and on the dataflow. From a hardware perspective, new IMC architectures include 1) multiple cells per multiplication unit to increase on-chip memory density and 2) multiple macros to increase dataflow flexibility and hence compute utilization. Nevertheless, from a dataflow standpoint, a suitable mapping scheme for operands in novel IMC designs is still missing such that the available dense memory is optimally utilized – minimizing thus data movement from and towards the IMC macros – while at the same time not sacrificing throughput and energy efficiency of the computation."