The energy post left one thread hanging. Running the SD1.5 UNet on
the Neural Engine buys 6-7x lower energy at the same speed as GPU/MPS, but the
SPLIT_EINSUM_V2 fp16 conversion lands 0.31% off an fp32 reference in cosine
terms — “a small numerical catch.” That number is a tensor distance. It says the
latents drift; it does not say whether a human looking at the decoded image would
ever notice, or which of the three converted components is to blame.
This post decodes to pixels and answers both. The short version up front: the
UNet conversion owns essentially all of the image-space loss (LPIPS 0.253), the
VAE and text-encoder conversions are near-lossless (LPIPS 0.003 and 0.038), and
the errors do not compound — the full converted pipeline (0.251) is within noise
of the UNet alone. The harness is the same one as before,
coreml-diffusion-benchmarks; every number below comes out of it.
Context
A 30-step image runs the UNet 30 times and the VAE decoder and text encoder once each. The energy post measured the UNet step in isolation — one tensor in, one tensor out — and reported a per-step cosine of 0.997 against fp32. That is the right unit for energy, but it is the wrong unit for “does the picture change”: per-step latent MSE and cosine tell you the tensors drift, not whether the drift survives 30 steps and a VAE decode into something a person can see.
So the open question is an attribution one. Convert SD1.5 for the ANE and the full pipeline does diverge from fp32 in image space — that part is not in doubt. But three components get converted (UNet, VAE decoder, text encoder), and the benchmark cannot say which one carries the loss. The way to find out is to convert them one at a time.
Method
The design is one-at-a-time (OAT) attribution. Start from an all-fp32 diffusers
pipeline — that is the reference, and the target every other config is measured
against. Then swap exactly one Core ML component into it at a time —
coreml-unet, coreml-vae, coreml-clip (the text encoder) — and finally the
endpoint with all three converted at once (coreml-full). Five configs including
the reference.
Swapping one component isolates blame, and the coreml-full endpoint tests
whether the per-component errors compound when stacked. The swap is clean
because the converted components are drop-in: CoreMLUNet, CoreMLVAE, and
CoreMLTextEncoder satisfy the diffusers component contracts, so for the
VAE-only and CLIP-only configs the UNet stays on fp32 torch and only the one
component under test runs on Core ML. Components served from Core ML execute on
CPU_AND_NE — the Neural Engine.
The inputs are fixed across every config: 10 committed prompts × seeds {0, 1, 2} =
30 samples per config, 30 steps, guidance 7.5, 512×512, the same checkpoint SHA
across all (recorded in provenance.json). Each generated image is compared to
the fp32 reference image for the same prompt and seed on four metrics:
- LPIPS — perceptual distance, lower is closer. The headline metric.
- SSIM / PSNR — structural similarity and peak SNR, higher is closer.
- CLIP score — on-promptness; does the image still match the text it was asked
for. This is the one that separates “different sample” from “wrong picture.”
(Note the name overlap: the
coreml-clipconfig is the text-encoder swap; the CLIP score is an image↔text metric applied to every config.)
The metric backbones are pinned so the numbers reproduce: LPIPS uses torchmetrics
1.6.1 LearnedPerceptualImagePatchSimilarity(net_type“vgg”, normalize=True)=, and
the CLIP score uses CLIPScore(model_name_or_path“openai/clip-vit-base-patch16”)=.
The converted artifacts are the ct9 toolchain (coremltools 9.0); the UNet uses
SPLIT_EINSUM_V2 attention at fp16, the same conversion measured in the energy
post.
The ladder itself, lifted verbatim from ablation_e2e.py — reference is first
because it is the comparison target, not just another row:
| |
Numbers
Median [p10-p90] over the 30 samples per config, each compared to the fp32
reference image for the matching prompt and seed. These are measured, from
runs/ablation-ct9-public/summary.md — the public coreml-diffusion 0.1.3
rebuild, which reproduced the original run bit-exact across all 120 samples, so the
table and the [reproduce] panel below point at the same data. The table reads in
one line: coreml-unet looks like coreml-full, and coreml-vae / coreml-clip
barely move.
| config | LPIPS ↓ | SSIM ↑ | PSNR ↑ | CLIP ↑ |
|---|---|---|---|---|
| coreml-unet | 0.253 [0.141–0.446] | 0.725 [0.495–0.878] | 18.403 [14.051–23.414] | 35.112 [30.661–38.546] |
| coreml-vae | 0.003 [0.003–0.005] | 0.997 [0.996–0.998] | 46.096 [41.676–51.067] | 35.229 [30.588–38.601] |
| coreml-clip | 0.038 [0.004–0.176] | 0.966 [0.780–0.995] | 28.791 [18.639–41.445] | 35.338 [30.487–38.253] |
| coreml-full | 0.251 [0.142–0.445] | 0.713 [0.507–0.873] | 18.389 [14.063–23.503] | 35.244 [30.392–38.801] |
The reference row is omitted because every distance metric there is perfect by
construction — self-distance is LPIPS 0, SSIM 1, PSNR ∞. CLIP score is the
exception: it is image↔text, not a self-distance, so the reference has a real
finite value of 35.2 (median over the same 30 images) — the anchor the CLIP read
below uses to show the converted configs match the reference, not merely each
other. Three reads, in order of how load-bearing they are:
The other two reads support it. The VAE conversion sits at the comparison’s own
noise floor: LPIPS 0.003 — below the 0.012 paired gap between coreml-full and
coreml-unet — with SSIM 0.997 and PSNR 46, a decoded image you cannot tell from
the fp32 one. And CLIP score holds in a tight 35.1–35.3 band across all four
converted configs — the fp32 reference itself scores 35.2 (median over the same
30 images), so the converted pipelines match its on-promptness, not merely each
other’s. With the UNet converted: the pictures move (LPIPS 0.25 is a visible
difference) but they do not move off-prompt — the divergence is sample-level, a
different valid image for the same prompt, not semantic drift toward the wrong
thing.


What broke
The text encoder is the one that does not behave like its median.
That is why the astronaut grid above is the hero image: prompt 0 seed 0 is the
argmax for both the UNet (LPIPS 0.541) and coreml-clip (0.404), so the one
frame carries the whole distribution’s tail. The likely mechanism is a dtype edge
in the text-encoder path — an fp16 embedding feeding fp32-sensitive downstream
math — the same class of issue as the upstream CoreMLTextEncoder output-dtype
fix; the ablation does not isolate it, so this is the hypothesis, not a result.
Takeaway
If you are converting SD1.5 for the ANE and you care about image quality, spend the worry on the UNet. The VAE conversion is genuinely free (LPIPS 0.003, at the noise floor); the text-encoder conversion is near-lossless in the median (0.038) but carries an occasional tail (p90 0.176, worst 0.404), so budget for the rare outlier rather than treating it as free. Either way the errors do not compound — you can convert the components independently and predict the full-pipeline hit from the UNet alone.
And the divergence is the right kind. CLIP score holds at ~35 everywhere: the converted pipeline does not drift off-prompt, it produces a different valid sample of the same prompt. Pair that with the energy post’s 6-7x and the tradeoff is honest — you are buying a large energy win for a sample that is different, not off-prompt.
One thread left open, stated as a question and not an answer: is the UNet hit
SPLIT_EINSUM_V2 specifically — the attention variant measured throughout here —
or fp16 in general? ORIGINAL vs SPLIT_EINSUM_V2 on the same UNet is the next
ablation.