The Bicameral Cache is a novel cache design that segregates scalar and vector data references to optimize performance on vector architectures by preserving the spatial locality of vector data and avoiding interference between scalar and vector accesses.
Enabling instruction-level preemption in heterogeneous co-processors can significantly reduce the duration of algorithmic priority and criticality inversions in mixed-criticality systems.
A novel weight packing algorithm that minimizes weight loading overheads and maximizes computational parallelism in in-memory computing accelerators.
PARALLAX, a compiler for neutral atom quantum computers, reduces high-error operations by 25% and increases the success rate by 28% on average compared to state-of-the-art techniques by leveraging the unique properties of neutral atom systems, such as multi-qubit gates, application-specific topologies, movable qubits, homogenous qubits, and long-range interactions.
DRAM 마이크로아키텍처와 오류 특성에 대한 정확한 이해를 위해 다양한 역공학 기법을 활용하여 DRAM 칩의 내부 구조와 동작을 심층적으로 분석하였다.
We develop an energy-efficient FPGA accelerator for the 110M parameter Llama 2 language model using high-level synthesis (HLS) techniques, achieving up to a 12.75x reduction in energy consumption per token compared to a CPU and an 8.25x reduction compared to a GPU, while maintaining 0.53x the inference speed of a high-end GPU.
The proposed Memory-Mapped Near-Data Processing (M2NDP) architecture enables low-overhead, general-purpose near-data processing in CXL memory expanders by introducing Memory-Mapped functions (M2func) for efficient offloading and Memory-Mapped µthreading (M2µthr) for cost-effective NDP kernel execution.
GPU 다중 인스턴스 환경에서 공유 L3 TLB로 인한 성능 저하를 해결하기 위해 서브 엔트리 공유 기반의 TLB 설계를 제안한다.
AMD Versal ACAP의 다중 인공지능 엔진(AIE)을 활용하여 GotoBLAS2의 병렬 일반 행렬 곱셈(GEMM) 알고리즘을 최적화하고, 딥러닝 추론을 위한 혼합 정밀도 연산을 지원하는 아키텍처 특화 마이크로 커널을 제안한다.
TDRAM is a novel DRAM microarchitecture that enhances HBM3 with on-die tag storage and fast tag comparison to enable efficient DRAM caching, reducing hit and miss latencies, bandwidth bloat, and energy consumption.