ANE runs SD1.5 UNet at 6-7x lower energy

If you run Stable Diffusion 1.5 on Apple Silicon in a loop — anything that bills you by the watt — the question that matters is not which backend is fastest. They are all within a hair of each other on latency. The question is which one spends the least energy doing it, and the answer is lopsided: on an M2 Pro the Neural Engine draws 6 to 7x less energy per UNet step than the same model on the GPU or MPS, at the same wall-clock speed. The catch is numerical, not temporal, and it is small enough that most generation pipelines will not care.

The harness is coreml-diffusion-benchmarks — every number below comes out of it, and the run is reproducible from the commit in the panel.

Context

The cost that compounds in a 24/7 diffusion pipeline is energy per denoising step, not per-image latency. A 50-step image repeats the UNet 50 times; everything else (VAE, CLIP) runs once. So the UNet step is the thing worth measuring in isolation, and it is the only thing measured here — no VAE, no text encoder in any timed path.

The matchup is four backends running the same SD1.5 UNet weights on the same M2 Pro: Apple’s ml-stable-diffusion (coremltools 8), my own coreml-diffusion pipeline (coremltools 9), diffusers on MPS, and MLX. Seven cells in total, fp16 across the board plus one 4-bit palettized Core ML cell.

What is compared

Every backend loads the same SD1.5 weights; the only intended variables are the runtime and, for the two Core ML paths, the coremltools version.

Source model. Stable Diffusion 1.5, v1-5-pruned-emaonly.safetensors (Hugging Face stable-diffusion-v1-5/stable-diffusion-v1-5), SHA-256 pinned. One checkpoint, every backend; a mismatch is fatal, not silently benchmarked.
Apple ct8. apple/ml-stable-diffusion via python-coreml-stable-diffusion 1.1.0, coremltools 8.3.0 — the historical Core ML baseline.
Ours ct9. coreml-diffusion 0.1.0 (also on PyPI), coremltools 9.0. Same conversion method as Apple’s path; the toolchain version is the single intended difference, which makes the ct8-vs-ct9 contrast clean.
diffusers on MPS. huggingface/diffusers 0.32.2 — the GPU reference path, and the fp32-on-CPU equivalence reference.
MLX. ml-explore/mlx-examples (mlx 0.31.2). Upstream ships SD2.1-base, SDXL, and Flux but not SD1.5, so the harness adapter supplies the SD1.5 UNet config.
Harness. coreml-diffusion-benchmarks 0.1.0 — full provenance in the Reproduce panel below.

Method

The whole point is a fair comparison, so the harness pins everything that can move:

One checkpoint, one input. All four backends load the same SD1.5 weights (SHA-pinned) and run the same input tensor — a fixed latent [2,4,64,64] and text-embedding [2,77,768], both SHA-pinned, seed 0. Same numbers in, so any difference out is the backend, not the data.
UNet only. One forward pass at timestep 500 (mid-schedule; the UNet’s compute cost is timestep-independent, so the choice does not bias latency or energy). No scheduler loop, no VAE.
Timing. One warmup pass discarded, then 10 timed iterations; median + IQR per run.
Power. powermetrics per-engine channels (gpu_power, ane_power) at 100 ms, baseline-subtracted against a 2 s idle window. Relative numbers only — this measures the delta the workload adds, not absolute board draw. The harness refuses to record power unless the host is on AC, low-power mode is off, and loadavg_1m ≤ 2.0, so a noisy machine cannot quietly pollute the baseline.
Repetition. Each cell runs 7 independent times with a 30 s cooldown between passes; the reported figure is the median across runs with a p10-p90 spread. (The “why 7 runs and not 1000 iterations” story is in What broke — it is the load-bearing methodology decision in this whole post.)
AC power, fixed, throughout. Battery changes the power-management regime.

The matrix is declarative; adding or removing a cell is a config edit, not a code change. An abbreviated cell plus the power and equivalence blocks (the full file has seven cells):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
- id: ours-ane-w4
  label: "Ours ct9 · ANE · split-einsum-v2 · 4-bit palettized"
  backend: coreml_diffusion
  compute_unit: CPU_AND_NE
  attention: SPLIT_EINSUM_V2
  precision: w4
  resolution: 512
  enabled: true

power:                  # per-engine, baseline-subtracted, relative-only
  interval_ms: 100      # fine enough to resolve a single ~200-400 ms UNet step
  baseline_seconds: 2
  samplers: [cpu_power, gpu_power, ane_power]

equivalence:            # MSE + cosine vs reference; flags, never drops
  reference:            # ground truth = diffusers UNet, fp32, on CPU
    backend: diffusers_mps
    device: cpu
    precision: fp32
  mse_max: 1.0e-3
  cosine_min: 0.999

Numbers

Median across 7 runs, p10-p90 in brackets. Latency in ms, energy in joules per UNet step. The last column is a linear 50x extrapolation of the single-step energy, not a measured 50-step run — it excludes VAE, CLIP, and scheduler overhead, and assumes no thermal drift across the image. (The harness labels this the same way: estimated_energy_per_50_step_image_j, an extrapolation, not a measured image.)

1
2
3
4
5
6
7
8
Cell                       Latency (ms)        Energy/step (J)     ~Energy/image (J)
Apple ct8 · ANE · fp16     402.0 [401.7-402.2] 1.622 [1.617-1.692]  81.1
Ours  ct9 · ANE · fp16     413.1 [412.5-413.5] 1.654 [1.653-1.697]  82.7
Ours  ct9 · ANE · w4       382.1 [381.8-382.3] 1.499 [1.454-1.539]  75.0
Apple ct8 · GPU · fp16     443.8 [443.1-444.1] 9.755 [9.627-9.878] 487.8
Ours  ct9 · GPU · fp16     494.5 [492.7-494.9] 10.53 [10.37-10.63] 526.5
diffusers · MPS · fp16     513.3 [512.9-513.9] 10.55 [10.36-10.62] 527.3
MLX   · GPU · fp16         496.0 [495.4-496.3] 9.956 [8.867-10.05] 497.8

Per-backend UNet-step latency vs energy per image, M2 Pro, n=7 medians. Same horizontal band, two vertical clusters: the ANE cells sit ~6-7x lower on energy at the same speed.

Put concretely against the GPU camp (9.76-10.55 J/step): Apple’s ANE fp16 cell lands at 6.0-6.5x lower energy, and the 4-bit ANE cell at 6.5-7.0x (the ranges share an edge because they divide the same GPU spread by two different ANE cells). Both ends of the “6-7x” are real measured cells, not a rounded headline.

The w4 cell is the standout: it is simultaneously the lowest energy (1.499 J), the lowest latency (382 ms), and a quarter of the on-disk size (430 MB weights vs 1.72 GB for fp16). Palettizing to 4 bits costs nothing on speed or energy here — it only costs accuracy, which is the next section.

The cost: numerical divergence

The energy win is not free. Against an fp32 CPU reference, the GPU/MPS cells are numerically near-identical (cosine = 1.00000, MSE ~1e-6 to 3e-6). The ANE cells are not:

1
2
3
4
5
Cell                     cosine vs fp32 ref    MSE          numerically_divergent
Apple ct8 · ANE · fp16   0.99690               5.65e-03     yes
Ours  ct9 · ANE · fp16   0.99689               5.66e-03     yes
Ours  ct9 · ANE · w4     0.99429               1.11e-02     yes
GPU / MPS (all fp16)     1.00000               ≤3e-6        no

The ANE path uses SPLIT_EINSUM_V2 attention and the Neural Engine’s own fp16 arithmetic, and it lands 0.31% off the reference in cosine terms; w4 widens that to 0.57%. The harness flags this rather than dropping the cell — divergence is a property to report, not a failure to hide.

Whether that matters is a pipeline question, not a benchmark question. For most SD1.5 generation work the output is an image judged by eye, and a cosine of 0.994 on the UNet output is invisible.

What broke

Two single runs of the same ours-ane-fp16 cell came back at 1.16 J and 1.69 J per step — a 45% swing — while latency between the same two runs stayed within ~1 ms. Other cells swung up to 82%. Single-run energy was simply not reproducible, and worse, the swing was large enough to reverse conclusions: in one run my ct9 pipeline looked more efficient than Apple’s ct8 baseline; in the next, less.

The instinct is to throw iterations at it — go from 10 to 1000 and let the law of large numbers smooth it out. That is the wrong fix, for two reasons. First, the variance is not within-run sampling noise; a 45-82% swing is far too large to be the standard error of ~40 power samples over a stationary signal. It is between-run non-stationarity — background processes drifting in and out of the measurement window, the system state differing from run to run. Averaging harder inside one contaminated run just gives you a smoother wrong number. Second, 1000 iterations means ~400 s of sustained load, far past the thermal envelope these short cells stay inside. All seven passes here reported throttled=false with a 30 s cooldown between them; a single ~400 s run is a different thermal regime, and a throttled UNet is not the UNet I set out to measure. Short repeated runs keep the chip in the cold regime by construction.

The fix is structural: run the cell N independent times, report the median and the spread across runs. At n=7 the energy spread collapsed to 1-5% on most cells. And crucially, the multi-run view catches the contamination instead of hiding it. MLX is the live example here — seven runs of mlx-gpu-fp16 energy:

1
9.956  9.836  10.063  10.045  9.908  10.004  7.413

mlx-gpu-fp16 energy across 7 runs; dashed line is the median (9.956 J), the run at 7.413 J flagged as the outlier. A single run landing there would have reported it as truth.

Six runs cluster tightly around 10 J; one came back at 7.41. The median (9.956) ignores the outlier, and the wide p10 (8.867) is the tell that one run misbehaved. A single run that happened to land on 7.41 would have reported it as truth. The spread is not noise to be eliminated — it is the honesty of the measurement.

Baseline: coremltools 8 vs 9

A secondary result, mostly of interest to me as the author of the ct9 converter. Swapping the toolchain (coremltools 8 -> 9, with a modern torch/numpy stack) and nothing else, on the matching ANE fp16 cells:

Latency: ct9 is ~2.8% slower (413.1 ms vs 402.0 ms). The p10-p90 intervals are tight and disjoint, so this difference is real, not noise.
Energy: indistinguishable. Apple ct8 at 1.622 J [1.617-1.692], ours ct9 at 1.654 J [1.653-1.697] — the intervals overlap. There is no energy regression and no energy win from the toolchain change. A negative result, and a clean one.

For a converter, that is the reassuring read: moving to coremltools 9 costs a few percent of latency on this workload and changes nothing about its energy profile. The headroom, if any, is in the conversion path, not the toolchain version.

Takeaway

For an energy-bound SD1.5 pipeline on Apple Silicon, the decision is clear: run the UNet on the ANE. Fp16 buys a 6x energy reduction over GPU/MPS at equal speed; 4-bit palettization pushes it to ~7x and a quarter of the disk footprint, if the 0.57% numerical divergence is acceptable for your use — and for image generation it usually is.

Two caveats earned the hard way. Never trust a single power measurement; report energy with a spread or do not report it. And coremltools 9 is a safe move on this workload — ~2.8% latency, no energy change — so the toolchain version is not where the optimization budget should go.

The harness is coreml-diffusion-benchmarks (PyPI release pending; for now uvx --from git+... as in the panel). The converter under the ct9 cells is coreml-diffusion.

The ANE runs the SD1.5 UNet at 6-7x lower energy than GPU/MPS — at the same speed

Benchmark

Context

What is compared

Method

Numbers

The cost: numerical divergence

What broke

Baseline: coremltools 8 vs 9

Takeaway

Reproduce