If you run Stable Diffusion 1.5 on Apple Silicon in a loop — anything that bills you by the watt — the question that matters is not which backend is fastest. They are all within a hair of each other on latency. The question is which one spends the least energy doing it, and the answer is lopsided: on an M2 Pro the Neural Engine draws 6 to 7x less energy per UNet step than the same model on the GPU or MPS, at the same wall-clock speed. The catch is numerical, not temporal, and it is small enough that most generation pipelines will not care.
The harness is coreml-diffusion-benchmarks — every number below comes out of it,
and the run is reproducible from the commit in the panel.
Context
The cost that compounds in a 24/7 diffusion pipeline is energy per denoising step, not per-image latency. A 50-step image repeats the UNet 50 times; everything else (VAE, CLIP) runs once. So the UNet step is the thing worth measuring in isolation, and it is the only thing measured here — no VAE, no text encoder in any timed path.
The matchup is four backends running the same SD1.5 UNet weights on the same M2
Pro: Apple’s ml-stable-diffusion (coremltools 8), my own coreml-diffusion
pipeline (coremltools 9), diffusers on MPS, and MLX. Seven cells in total, fp16
across the board plus one 4-bit palettized Core ML cell.
What is compared
Every backend loads the same SD1.5 weights; the only intended variables are the runtime and, for the two Core ML paths, the coremltools version.
- Source model. Stable Diffusion 1.5,
v1-5-pruned-emaonly.safetensors(Hugging Facestable-diffusion-v1-5/stable-diffusion-v1-5), SHA-256 pinned. One checkpoint, every backend; a mismatch is fatal, not silently benchmarked. - Apple ct8.
apple/ml-stable-diffusionviapython-coreml-stable-diffusion1.1.0, coremltools 8.3.0 — the historical Core ML baseline. - Ours ct9.
coreml-diffusion0.1.0 (also on PyPI), coremltools 9.0. Same conversion method as Apple’s path; the toolchain version is the single intended difference, which makes the ct8-vs-ct9 contrast clean. - diffusers on MPS.
huggingface/diffusers0.32.2 — the GPU reference path, and the fp32-on-CPU equivalence reference. - MLX.
ml-explore/mlx-examples(mlx 0.31.2). Upstream ships SD2.1-base, SDXL, and Flux but not SD1.5, so the harness adapter supplies the SD1.5 UNet config. - Harness.
coreml-diffusion-benchmarks0.1.0 — full provenance in the Reproduce panel below.
Method
The whole point is a fair comparison, so the harness pins everything that can move:
- One checkpoint, one input. All four backends load the same SD1.5 weights
(SHA-pinned) and run the same input tensor — a fixed latent
[2,4,64,64]and text-embedding[2,77,768], both SHA-pinned, seed 0. Same numbers in, so any difference out is the backend, not the data. - UNet only. One forward pass at timestep 500 (mid-schedule; the UNet’s compute cost is timestep-independent, so the choice does not bias latency or energy). No scheduler loop, no VAE.
- Timing. One warmup pass discarded, then 10 timed iterations; median + IQR per run.
- Power.
powermetricsper-engine channels (gpu_power,ane_power) at 100 ms, baseline-subtracted against a 2 s idle window. Relative numbers only — this measures the delta the workload adds, not absolute board draw. The harness refuses to record power unless the host is on AC, low-power mode is off, andloadavg_1m ≤ 2.0, so a noisy machine cannot quietly pollute the baseline. - Repetition. Each cell runs 7 independent times with a 30 s cooldown between passes; the reported figure is the median across runs with a p10-p90 spread. (The “why 7 runs and not 1000 iterations” story is in What broke — it is the load-bearing methodology decision in this whole post.)
- AC power, fixed, throughout. Battery changes the power-management regime.
The matrix is declarative; adding or removing a cell is a config edit, not a code change. An abbreviated cell plus the power and equivalence blocks (the full file has seven cells):
- id: ours-ane-w4
label: "Ours ct9 · ANE · split-einsum-v2 · 4-bit palettized"
backend: coreml_diffusion
compute_unit: CPU_AND_NE
attention: SPLIT_EINSUM_V2
precision: w4
resolution: 512
enabled: true
power: # per-engine, baseline-subtracted, relative-only
interval_ms: 100 # fine enough to resolve a single ~200-400 ms UNet step
baseline_seconds: 2
samplers: [cpu_power, gpu_power, ane_power]
equivalence: # MSE + cosine vs reference; flags, never drops
reference: # ground truth = diffusers UNet, fp32, on CPU
backend: diffusers_mps
device: cpu
precision: fp32
mse_max: 1.0e-3
cosine_min: 0.999
Numbers
Median across 7 runs, p10-p90 in brackets. Latency in ms, energy in joules per
UNet step. The last column is a linear 50x extrapolation of the single-step
energy, not a measured 50-step run — it excludes VAE, CLIP, and scheduler
overhead, and assumes no thermal drift across the image. (The harness labels this
the same way: estimated_energy_per_50_step_image_j, an extrapolation, not a
measured image.)
Cell Latency (ms) Energy/step (J) ~Energy/image (J)
Apple ct8 · ANE · fp16 402.0 [401.7-402.2] 1.622 [1.617-1.692] 81.1
Ours ct9 · ANE · fp16 413.1 [412.5-413.5] 1.654 [1.653-1.697] 82.7
Ours ct9 · ANE · w4 382.1 [381.8-382.3] 1.499 [1.454-1.539] 75.0
Apple ct8 · GPU · fp16 443.8 [443.1-444.1] 9.755 [9.627-9.878] 487.8
Ours ct9 · GPU · fp16 494.5 [492.7-494.9] 10.53 [10.37-10.63] 526.5
diffusers · MPS · fp16 513.3 [512.9-513.9] 10.55 [10.36-10.62] 527.3
MLX · GPU · fp16 496.0 [495.4-496.3] 9.956 [8.867-10.05] 497.8
Put concretely against the GPU camp (9.76-10.55 J/step): Apple’s ANE fp16 cell lands at 6.0-6.5x lower energy, and the 4-bit ANE cell at 6.5-7.0x (the ranges share an edge because they divide the same GPU spread by two different ANE cells). Both ends of the “6-7x” are real measured cells, not a rounded headline.
The w4 cell is the standout: it is simultaneously the lowest energy (1.499 J), the lowest latency (382 ms), and a quarter of the on-disk size (430 MB weights vs 1.72 GB for fp16). Palettizing to 4 bits costs nothing on speed or energy here — it only costs accuracy, which is the next section.
The cost: numerical divergence
The energy win is not free. Against an fp32 CPU reference, the GPU/MPS cells are numerically near-identical (cosine = 1.00000, MSE ~1e-6 to 3e-6). The ANE cells are not:
Cell cosine vs fp32 ref MSE numerically_divergent
Apple ct8 · ANE · fp16 0.99690 5.65e-03 yes
Ours ct9 · ANE · fp16 0.99689 5.66e-03 yes
Ours ct9 · ANE · w4 0.99429 1.11e-02 yes
GPU / MPS (all fp16) 1.00000 ≤3e-6 no
The ANE path uses SPLIT_EINSUM_V2 attention and the Neural Engine’s own fp16
arithmetic, and it lands 0.31% off the reference in cosine terms; w4 widens that
to 0.57%. The harness flags this rather than dropping the cell — divergence is a
property to report, not a failure to hide.
Whether that matters is a pipeline question, not a benchmark question. For most SD1.5 generation work the output is an image judged by eye, and a cosine of 0.994 on the UNet output is invisible.
What broke
Two single runs of the same ours-ane-fp16 cell came back at 1.16 J and 1.69 J
per step — a 45% swing — while latency between the same two runs stayed within
~1 ms. Other cells swung up to 82%. Single-run energy was simply not reproducible,
and worse, the swing was large enough to reverse conclusions: in one run my ct9
pipeline looked more efficient than Apple’s ct8 baseline; in the next, less.
The instinct is to throw iterations at it — go from 10 to 1000 and let the law of
large numbers smooth it out. That is the wrong fix, for two reasons. First, the
variance is not within-run sampling noise; a 45-82% swing is far too large to be
the standard error of ~40 power samples over a stationary signal. It is
between-run non-stationarity — background processes drifting in and out of the
measurement window, the system state differing from run to run. Averaging harder
inside one contaminated run just gives you a smoother wrong number. Second, 1000
iterations means ~400 s of sustained load, far past the thermal envelope these
short cells stay inside. All seven passes here reported throttled=false with a
30 s cooldown between them; a single ~400 s run is a different thermal regime, and
a throttled UNet is not the UNet I set out to measure. Short repeated runs keep the
chip in the cold regime by construction.
The fix is structural: run the cell N independent times, report the median and the
spread across runs. At n=7 the energy spread collapsed to 1-5% on most cells. And
crucially, the multi-run view catches the contamination instead of hiding it.
MLX is the live example here — seven runs of mlx-gpu-fp16 energy:
9.956 9.836 10.063 10.045 9.908 10.004 7.413
Six runs cluster tightly around 10 J; one came back at 7.41. The median (9.956) ignores the outlier, and the wide p10 (8.867) is the tell that one run misbehaved. A single run that happened to land on 7.41 would have reported it as truth. The spread is not noise to be eliminated — it is the honesty of the measurement.
Baseline: coremltools 8 vs 9
A secondary result, mostly of interest to me as the author of the ct9 converter. Swapping the toolchain (coremltools 8 -> 9, with a modern torch/numpy stack) and nothing else, on the matching ANE fp16 cells:
- Latency: ct9 is ~2.8% slower (413.1 ms vs 402.0 ms). The p10-p90 intervals are tight and disjoint, so this difference is real, not noise.
- Energy: indistinguishable. Apple ct8 at 1.622 J [1.617-1.692], ours ct9 at 1.654 J [1.653-1.697] — the intervals overlap. There is no energy regression and no energy win from the toolchain change. A negative result, and a clean one.
For a converter, that is the reassuring read: moving to coremltools 9 costs a few percent of latency on this workload and changes nothing about its energy profile. The headroom, if any, is in the conversion path, not the toolchain version.
Takeaway
For an energy-bound SD1.5 pipeline on Apple Silicon, the decision is clear: run the UNet on the ANE. Fp16 buys a 6x energy reduction over GPU/MPS at equal speed; 4-bit palettization pushes it to ~7x and a quarter of the disk footprint, if the 0.57% numerical divergence is acceptable for your use — and for image generation it usually is.
Two caveats earned the hard way. Never trust a single power measurement; report energy with a spread or do not report it. And coremltools 9 is a safe move on this workload — ~2.8% latency, no energy change — so the toolchain version is not where the optimization budget should go.
The harness is coreml-diffusion-benchmarks (PyPI release pending; for now
uvx --from git+... as in the panel). The converter under the ct9 cells is
coreml-diffusion.