Apple-Silicon | log.aszc.dev

2026-06-08

Converting SD1.5 to Core ML: the UNet carries essentially all the fidelity loss

The energy post left one thread hanging. Running the SD1.5 UNet on the Neural Engine buys 6-7x lower energy at the same speed as GPU/MPS, but the SPLIT_EINSUM_V2 fp16 conversion lands 0.31% off an fp32 reference in cosine terms — “a small numerical catch.” That number is a tensor distance. It says the latents drift; it does not say whether a human looking at the decoded image would ever notice, or which of the three converted components is to blame.

2026-06-01

The ANE runs the SD1.5 UNet at 6-7x lower energy than GPU/MPS — at the same speed

If you run Stable Diffusion 1.5 on Apple Silicon in a loop — anything that bills you by the watt — the question that matters is not which backend is fastest. They are all within a hair of each other on latency. The question is which one spends the least energy doing it, and the answer is lopsided: on an M2 Pro the Neural Engine draws 6 to 7x less energy per UNet step than the same model on the GPU or MPS, at the same wall-clock speed. The catch is numerical, not temporal, and it is small enough that most generation pipelines will not care.