SCALE-Sim PE Array Modeling: Cycles, Utilization, and Loss Sources
Sub-researcher note: Covers cycle-approximate modeling methodology only. Out of scope: memory hierarchy, NoC, GPU/SIMT, chip design details.
1. SCALE-Sim Overview
SCALE-Sim (Systolic CNN Accelerator Simulator) is a cycle-approximate simulator for systolic array–based DNN accelerators. It accepts an accelerator config (array size, dataflow, SRAM sizes) and a DNN layer description, then outputs:
- Compute cycle counts per layer
- On-chip memory access traces
- Bandwidth demands
Three generations exist: v1 (2018), v2 (~2021, adds double-buffered SRAM), v3 (2025, adds multi-core, spatio-temporal partitioning, sparse tensor support).
2. Dataflow Taxonomy (Mapping)
SCALE-Sim models three stationary dataflows. Each maps the three GEMM dimensions (M = output channels, N = input channels, P = spatial) differently onto the 2-D PE array:
| Dataflow | Spatial Rows | Spatial Columns | Temporal (streamed) |
|---|---|---|---|
| Weight-Stationary (WS) | M | N | P |
| Input-Stationary (IS) | N | P | M |
| Output-Stationary (OS) | M | P | N |
"Stationary" means that data type stays in PE local registers and is reused; the other two operands stream through. The dataflow choice determines:
- Which dimension iterates spatially (mapped once to PEs)
- Which iterates temporally (serialized across cycles)
- Which tiles stream as a wavefront through the systolic pipeline
3. Cycle Computation Formula
3.1 Base case (single tile, single core)
For a systolic array of shape R×C executing a single tile with spatial dimensions Sr × Sc and temporal depth T, the basic cycle formula is:
Cycles_tile = (2*R + C + T - 2)
The breakdown:
(R + C - 2): pipeline fill (startup) — diagonal wavefront must traverse R rows then C columns before the first output is producedT: the temporal accumulation depth (e.g., inner-product accumulation steps)R - 1: additional drain cycles after the last weight row exits
This is the standard systolic pipeline latency: R + C - 2 fill + T compute cycles.
3.2 Multi-tile execution (realistic layers)
A layer's dimensions (M, N, P) typically exceed the array size (R, C). Tiles are executed sequentially (or partitioned). SCALE-Sim v3 gives three variants depending on how partitioning is applied:
Spatial partitioning (Eq. 1):
Cycles = (2*R + C + T - 2) * ceil(Sr / (Pr/R)) * ceil(Sc / (Pc/C))
Spatio-temporal variant 1 (Eq. 2):
Cycles = (2*R + C + ceil(T/Pc) - 2) * ceil(Sr / (Pr/R)) * ceil(Sc / Pc)
Spatio-temporal variant 2 (Eq. 3):
Cycles = (2*R + C + ceil(T/Pr) - 2) * ceil(Sr / R) * ceil(Sc / (Pc/C))
Where:
- R, C = physical array dimensions (rows, columns)
- Sr, Sc, T = spatial row, spatial col, and temporal dimensions of the full layer tile
- Pr, Pc = number of partitions along rows and columns respectively
ceil()= ceiling division, forced by the integer number of array passes needed
For the simple single-core case (Pr = Pc = 1), Eq. 1 reduces to:
Cycles = (2*R + C + T - 2) * ceil(M/R) * ceil(N/C)
(for Weight-Stationary; dimension mapping follows the table in §2.)
4. Array Utilization: Definition and Computation
Definition: Utilization = fraction of PE-cycles performing useful MACs.
Utilization = actual_MACs / (R * C * total_cycles)
Where actual_MACs = product of the true layer dimensions (no padding/rounding),
and R * C * total_cycles = peak MAC capacity of the array over those cycles.
SCALE-Sim v3 also expresses this through MAC action counts:
MAC_random = #PEs × cycles × utilization (active MACs)
MAC_constant = #PEs × cycles × (1 - utilization) (idle/wasted PE-cycles)
5. Sources of Utilization Loss
5.1 Tile Quantization (spatial boundary effect)
When layer dimensions M, N are not exact multiples of array size R, C, the last tile along each dimension is partial: only a fraction of PEs do real work, but the full pipeline startup/drain cost is still paid.
Example: M=130 on a 128-row array. The last spatial pass has only 2 active rows out of
128 — but takes the same (2*R + C + T - 2) cycles as a full tile. This is the
ceil-rounding penalty:
wasted_fraction ≈ 1 - M/(ceil(M/R) * R)
This loss is most severe when a dimension is slightly above a multiple of the array size.
5.2 Wave Quantization (temporal/multi-SM effect)
When the total number of tiles (ceil(M/R) × ceil(N/C)) is not a multiple of available parallel execution units, a partial last wave occurs. The tail wave uses only a fraction of the available parallelism but takes roughly the same wall-clock time as a full wave.
From NVIDIA's analysis (GPU context, but directly analogous): if 109 tiles must run on 108 SMs, the last wave uses only 1/108 = 0.93% of capacity yet takes as long as the first full wave of 108 tiles. GFLOPS can roughly halve at these quantization boundaries.
For a systolic array, the analogous effect occurs when a layer requires a fractional number of complete array "sweeps." The final partial sweep incurs full pipeline startup/drain overhead but delivers proportionally fewer useful MACs.
5.3 Shape Mismatch (aspect ratio mismatch)
When layer aspect ratio (M:N) differs from array aspect ratio (R:C), one dimension will have worse quantization loss than the other. SCALE-Sim v1 analyzed this explicitly: varying array aspect ratios on fixed workloads shows substantial utilization variation.
The paper states that "bandwidth, dataflow, and aspect ratio" are the three primary axes for exploring accelerator efficiency.
5.4 Pipeline Startup Overhead (fill/drain tax)
Even for perfectly-divisible shapes, every tile incurs (R + C - 2) fill cycles and
(R - 1) drain cycles where the pipeline is not fully utilized. For small T (short
accumulation depths), this overhead fraction is large:
overhead_fraction = (2*R + C - 2) / (2*R + C + T - 2)
This makes the simulator favor workloads with large T (deep reductions) over thin, wide GEMMs.
6. Accuracy Claims
SCALE-Sim v3: "100% accuracy compared to Sparse Tensor Core's report" for Ampere 2:4 sparsity patterns; "≤5% error" versus RTL validation for VEGETA flexible N:M sparsity accelerator.
SCALE-Sim v2: "Analytical compute cycles validated by RTL simulation" (per GitHub README, with specific numbers in cited ISPASS papers).
SCALE-Sim v1 (2018): Case studies on ResNet/AlexNet layers; validation against known accelerator configurations; quantitative claims on bandwidth vs cycle tradeoffs but not a specific accuracy % vs silicon.
SCALE-Sim is explicitly cycle-approximate, not cycle-exact. Its power comes from analytical formulas (no event simulation needed for the compute core), enabling rapid design-space exploration over thousands of configurations in seconds.
7. Key Differences Between v1, v2, v3
| Feature | v1 (2018) | v2 (~2021) | v3 (2025) |
|---|---|---|---|
| Dataflows | OS/WS/IS | OS/WS/IS | OS/WS/IS + spatio-temporal partitioning |
| Memory model | Ideal SRAM | Double-buffered SRAM | Ramulator DRAM integration |
| Sparsity | No | No | 2:4 and N:M structured sparsity |
| Multi-core | No | No | Yes (spatio-temporal split) |
| Energy | No | No | Accelergy-based |
Citations
-
SCALE-Sim v1 (original): Samajdar et al., "SCALE-Sim: Systolic CNN Accelerator Simulator," arXiv 1811.02883, 2018. https://arxiv.org/abs/1811.02883
-
SCALE-Sim v3: "SCALE-Sim v3: A modular cycle-accurate systolic accelerator simulator for end-to-end system analysis," arXiv 2504.15377, 2025. https://arxiv.org/html/2504.15377v1
-
SCALE-Sim project / v2 repo: scalesim-project/scale-sim-v2, GitHub. https://github.com/scalesim-project/scale-sim-v2
-
NVIDIA Matrix Multiplication Background: NVIDIA DL Performance Guide — tile quantization and wave quantization definitions with A100 empirical data. https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html
-
Systolic Array Dataflows for Efficient MatMul: "Systolic Array Data Flows for Efficient Matrix Multiplication in Deep Neural Networks," arXiv 2410.22595, 2024. (Contains Eq.
NC = 2*SR + SC + T - 2and WS/IS/OS dimension mapping table.) https://arxiv.org/html/2410.22595v1