2026 · RTL designer

GeMM-64 — A Highly Efficient Matrix-Multiply Accelerator

A 64-PE matrix-multiply accelerator written in SystemVerilog. Three tile-mapping modes, dual-FSM pipelining, and a roofline telling the system is near maxed out

GeMM-64 — A Highly Efficient Matrix-Multiply Accelerator cover

Overview

GeMM-64 is a purely digital design project. A fast hardware accelerator for dense matrix multiplication, built from scratch in SystemVerilog. The aim wasn’t mere correctness, I strived for near optimal hardware usage across multiple problem dimensionalities.

Microarchitecture

  • 64 Processing Elements (PEs) arranged in a 8x8 array fed by a dynamic broadcasting network. Each PE is a multiply-accumulate (MAC) unit with pipelined shadow registers so a tile’s results can drain while the next tile’s inputs are already being consumed, inherently removing a bus-constrained SRAM bottleneck.
  • Dual-FSM control. Two FSMs work concurrently. The write-back stage overlaps with the next tile’s computation. That cuts ~35–45% MAC off cycle counts versus a serialized baseline, drastically increasing performance.
  • Three tile-mapping modes. Problem definitions stated three different sized matrix-matrix workloads. Smartly packed matrix elements arrive at the MAC without increased latency.
  • Memory layout. Multi-step tiling matched to a 128-bit-wide SRAM read; arithmetic intensity (AI) is fine-tuned to keep the array fed without burning bandwidth on partial loads.

CPAEP architecture diagram

Highlights

  • Roofline-driven. Performance characterized as ops/cycle vs. arithmetic intensity; the design lands close to the compute roof for the tested workloads, not the bandwidth roof.
  • Pipelined Shadow Registers. Adding an output pipelined register is well worth the area cost, as this allows a near perfect MAC utilization across workloads.

Status

Final report submitted in May 2026 with a 19/20 received grade.

Download report (PDF) →