Converting SD1.5 to Core ML: the UNet carries essentially all the fidelity loss
The energy post left one thread hanging. Running the SD1.5 UNet on
the Neural Engine buys 6-7x lower energy at the same speed as GPU/MPS, but the
SPLIT_EINSUM_V2 fp16 conversion lands 0.31% off an fp32 reference in cosine
terms — “a small numerical catch.” That number is a tensor distance. It says the
latents drift; it does not say whether a human looking at the decoded image would
ever notice, or which of the three converted components is to blame.