Converting SD1.5 to Core ML: the UNet carries essentially all the fidelity loss

The energy post left one thread hanging. Running the SD1.5 UNet on the Neural Engine buys 6-7x lower energy at the same speed as GPU/MPS, but the SPLIT_EINSUM_V2 fp16 conversion lands 0.31% off an fp32 reference in cosine terms — “a small numerical catch.” That number is a tensor distance. It says the latents drift; it does not say whether a human looking at the decoded image would ever notice, or which of the three converted components is to blame.