Where Core ML SD1.5 conversion loses fidelity

The energy post left one thread hanging. Running the SD1.5 UNet on the Neural Engine buys 6-7x lower energy at the same speed as GPU/MPS, but the SPLIT_EINSUM_V2 fp16 conversion lands 0.31% off an fp32 reference in cosine terms — “a small numerical catch.” That number is a tensor distance. It says the latents drift; it does not say whether a human looking at the decoded image would ever notice, or which of the three converted components is to blame.

This post decodes to pixels and answers both. The short version up front: the UNet conversion owns essentially all of the image-space loss (LPIPS 0.253), the VAE and text-encoder conversions are near-lossless (LPIPS 0.003 and 0.038), and the errors do not compound — the full converted pipeline (0.251) is within noise of the UNet alone. The harness is the same one as before, coreml-diffusion-benchmarks; every number below comes out of it.

Context

A 30-step image runs the UNet 30 times and the VAE decoder and text encoder once each. The energy post measured the UNet step in isolation — one tensor in, one tensor out — and reported a per-step cosine of 0.997 against fp32. That is the right unit for energy, but it is the wrong unit for “does the picture change”: per-step latent MSE and cosine tell you the tensors drift, not whether the drift survives 30 steps and a VAE decode into something a person can see.

So the open question is an attribution one. Convert SD1.5 for the ANE and the full pipeline does diverge from fp32 in image space — that part is not in doubt. But three components get converted (UNet, VAE decoder, text encoder), and the benchmark cannot say which one carries the loss. The way to find out is to convert them one at a time.

Method

The design is one-at-a-time (OAT) attribution. Start from an all-fp32 diffusers pipeline — that is the reference, and the target every other config is measured against. Then swap exactly one Core ML component into it at a time — coreml-unet, coreml-vae, coreml-clip (the text encoder) — and finally the endpoint with all three converted at once (coreml-full). Five configs including the reference.

Swapping one component isolates blame, and the coreml-full endpoint tests whether the per-component errors compound when stacked. The swap is clean because the converted components are drop-in: CoreMLUNet, CoreMLVAE, and CoreMLTextEncoder satisfy the diffusers component contracts, so for the VAE-only and CLIP-only configs the UNet stays on fp32 torch and only the one component under test runs on Core ML. Components served from Core ML execute on CPU_AND_NE — the Neural Engine.

The inputs are fixed across every config: 10 committed prompts × seeds {0, 1, 2} = 30 samples per config, 30 steps, guidance 7.5, 512×512, the same checkpoint SHA across all (recorded in provenance.json). Each generated image is compared to the fp32 reference image for the same prompt and seed on four metrics:

LPIPS — perceptual distance, lower is closer. The headline metric.
SSIM / PSNR — structural similarity and peak SNR, higher is closer.
CLIP score — on-promptness; does the image still match the text it was asked for. This is the one that separates “different sample” from “wrong picture.” (Note the name overlap: the coreml-clip config is the text-encoder swap; the CLIP score is an image↔text metric applied to every config.)

The metric backbones are pinned so the numbers reproduce: LPIPS uses torchmetrics 1.6.1 LearnedPerceptualImagePatchSimilarity(net_type“vgg”, normalize=True)=, and the CLIP score uses CLIPScore(model_name_or_path“openai/clip-vit-base-patch16”)=.

The converted artifacts are the ct9 toolchain (coremltools 9.0); the UNet uses SPLIT_EINSUM_V2 attention at fp16, the same conversion measured in the energy post.

The ladder itself, lifted verbatim from ablation_e2e.py — reference is first because it is the comparison target, not just another row:

1
2
3
4
5
6
7
8
# ablation_e2e.py — the OAT ladder + endpoint. `reference` MUST be first.
LADDER = [
    SwapConfig("reference",   unet=False, vae=False, text_encoder=False),
    SwapConfig("coreml-unet", unet=True,  vae=False, text_encoder=False),
    SwapConfig("coreml-vae",  unet=False, vae=True,  text_encoder=False),
    SwapConfig("coreml-clip", unet=False, vae=False, text_encoder=True),
    SwapConfig("coreml-full", unet=True,  vae=True,  text_encoder=True),
]

Numbers

Median [p10-p90] over the 30 samples per config, each compared to the fp32 reference image for the matching prompt and seed. These are measured, from runs/ablation-ct9-public/summary.md — the public coreml-diffusion 0.1.3 rebuild, which reproduced the original run bit-exact across all 120 samples, so the table and the [reproduce] panel below point at the same data. The table reads in one line: coreml-unet looks like coreml-full, and coreml-vae / coreml-clip barely move.

config	LPIPS ↓	SSIM ↑	PSNR ↑	CLIP ↑
coreml-unet	0.253 [0.141–0.446]	0.725 [0.495–0.878]	18.403 [14.051–23.414]	35.112 [30.661–38.546]
coreml-vae	0.003 [0.003–0.005]	0.997 [0.996–0.998]	46.096 [41.676–51.067]	35.229 [30.588–38.601]
coreml-clip	0.038 [0.004–0.176]	0.966 [0.780–0.995]	28.791 [18.639–41.445]	35.338 [30.487–38.253]
coreml-full	0.251 [0.142–0.445]	0.713 [0.507–0.873]	18.389 [14.063–23.503]	35.244 [30.392–38.801]

The reference row is omitted because every distance metric there is perfect by construction — self-distance is LPIPS 0, SSIM 1, PSNR ∞. CLIP score is the exception: it is image↔text, not a self-distance, so the reference has a real finite value of 35.2 (median over the same 30 images) — the anchor the CLIP read below uses to show the converted configs match the reference, not merely each other. Three reads, in order of how load-bearing they are:

The other two reads support it. The VAE conversion sits at the comparison’s own noise floor: LPIPS 0.003 — below the 0.012 paired gap between coreml-full and coreml-unet — with SSIM 0.997 and PSNR 46, a decoded image you cannot tell from the fp32 one. And CLIP score holds in a tight 35.1–35.3 band across all four converted configs — the fp32 reference itself scores 35.2 (median over the same 30 images), so the converted pipelines match its on-promptness, not merely each other’s. With the UNet converted: the pictures move (LPIPS 0.25 is a visible difference) but they do not move off-prompt — the divergence is sample-level, a different valid image for the same prompt, not semantic drift toward the wrong thing.

Reference vs each converted config for the astronaut prompt, seed 0: reference, coreml-unet, coreml-vae, coreml-clip, coreml-full side by side. The unet and full columns visibly differ from the reference; vae is indistinguishable. — Prompt 0 (astronaut), seed 0 — fp32 reference vs each config. This frame is the argmax for both: coreml-unet (LPIPS 0.541) and coreml-clip (0.404) are each at their single worst sample here. coreml-vae is pixel-identical to the reference; coreml-full tracks coreml-unet.

Reference vs each converted config for the corgi prompt, seed 0; the converted UNet and full columns differ visibly from the reference while staying on-subject, vae is near-identical. — A typical sample (prompt 4, "a cute corgi wearing sunglasses", seed 0) — coreml-unet LPIPS 0.252, right at the distribution's median. This is what the headline 0.25 looks like: the converted UNet shifts detail but holds the subject. The 0.25 median is a distribution, not a constant tax.

What broke

The text encoder is the one that does not behave like its median.

That is why the astronaut grid above is the hero image: prompt 0 seed 0 is the argmax for both the UNet (LPIPS 0.541) and coreml-clip (0.404), so the one frame carries the whole distribution’s tail. The likely mechanism is a dtype edge in the text-encoder path — an fp16 embedding feeding fp32-sensitive downstream math — the same class of issue as the upstream CoreMLTextEncoder output-dtype fix; the ablation does not isolate it, so this is the hypothesis, not a result.

Takeaway

If you are converting SD1.5 for the ANE and you care about image quality, spend the worry on the UNet. The VAE conversion is genuinely free (LPIPS 0.003, at the noise floor); the text-encoder conversion is near-lossless in the median (0.038) but carries an occasional tail (p90 0.176, worst 0.404), so budget for the rare outlier rather than treating it as free. Either way the errors do not compound — you can convert the components independently and predict the full-pipeline hit from the UNet alone.

And the divergence is the right kind. CLIP score holds at ~35 everywhere: the converted pipeline does not drift off-prompt, it produces a different valid sample of the same prompt. Pair that with the energy post’s 6-7x and the tradeoff is honest — you are buying a large energy win for a sample that is different, not off-prompt.

One thread left open, stated as a question and not an answer: is the UNet hit SPLIT_EINSUM_V2 specifically — the attention variant measured throughout here — or fp16 in general? ORIGINAL vs SPLIT_EINSUM_V2 on the same UNet is the next ablation.

Converting SD1.5 to Core ML: the UNet carries essentially all the fidelity loss

Benchmark