The ANE runs the SD1.5 UNet at 6-7x lower energy than GPU/MPS — at the same speed

If you run Stable Diffusion 1.5 on Apple Silicon in a loop — anything that bills you by the watt — the question that matters is not which backend is fastest. They are all within a hair of each other on latency. The question is which one spends the least energy doing it, and the answer is lopsided: on an M2 Pro the Neural Engine draws 6 to 7x less energy per UNet step than the same model on the GPU or MPS, at the same wall-clock speed. The catch is numerical, not temporal, and it is small enough that most generation pipelines will not care.