I am building a macOS menu-bar app that listens to live disk I/O and plays back the mechanical chatter of a 1990s hard drive in response. The synthesis engine is the whole point, and my first one was wrong in a way that took a while to hear: it modelled the wrong motor. This is the log of getting from that first engine to one that actually clicks.

The target was a single reference recording — a clean capture of an old PC drive seeking and chattering.

First engine: the wrong motor

The intuitive picture of an old hard drive is a stepper motor: discrete steps, a buzz whose pitch tracks the step rate. So the first synth was exactly that — a train of per-step impulses, the fundamental pitch following the commanded step rate, harmonics from the near-square torque pulse. It sounds mechanical. It even sounds old. It just doesn’t sound like the thing I was chasing.

First engine — stepper model, short (track-to-track) move.
First engine — stepper model, full-stroke move. Note the rising-then-falling buzz.

That buzz is real, but it belongs to a different class of drive. Which is the part I had backwards.

The reversal: it was never a stepper

When I went back to the source material instead of my assumptions, the picture inverted.

There is a real evidence gap here, and it is worth stating plainly: there is no published instrumented acoustic spectrum of the specific vintage parts. The model that follows is extrapolated from documented OEM seek timings, voice-coil-motor patents describing the noise mechanism, and the reference recording — and where it is extrapolated, it is flagged as such. This is physically-motivated synthesis, not a reconstruction of a measured drive.

The first thing the reference clip got me was structure. Slicing it into individual click events and projecting them down shows the clicks are not one sound — they fall into families (a darker, longer body and a brighter, sharper attack), which is consistent with a two-transient seek whose two current edges vary in relative weight as the seek distance changes.

PCA scatter of reference clicks separating into a dark and a bright family
Clicks extracted from the reference clip, PCA-projected. Left: k-means clusters (a dark family and a bright family). Right: the same points colored by the source segment they came from.

Rebuilding from voice-coil physics

The rewrite replaced the stepper synth with a voice-coil seek voice. Three documented mechanisms, layered per seek:

  1. Bang-bang seek current → two transients. The voice-coil current during a conventional seek is essentially a square wave: a positive pulse accelerates the head, a negative pulse brakes it. Each current edge is a broadband impulse that excites the actuator. Seagate patent 6,937,428 puts it directly — the “abrupt application of current… to quickly accelerate and decelerate” gives “broad spectrum excitation of the actuator” — and the earlier 5,760,992 describes the same seek noise in its own terms, the actuator “blow” setting up “resonant vibrations in the actuator… to produce readily audible noise.” For a short seek the two edges merge into one click; for a long seek a coast phase separates them into a distinct launch and arrival.
  2. Crash-stop impact. The actuator’s travel limits are elastomer bumpers; the head arriving against one produces the end-of-stroke “thunk.”
  3. Settle ring. The abrupt arrival excites lightly-damped arm resonances (~0.8–3.8 kHz) that ring down after the head lands.

In the lab the voice is a single function. Trimmed for readability, the core is:

def render_vca_seek(p: VCAParams) -> AudioBuffer:
    # Stroke duration scales with seek distance: track-to-track .. full stroke.
    seek_ms = p.min_seek_ms + (p.max_seek_ms - p.min_seek_ms) * p.seek_distance

    # 1. Bang-bang current. The accel/decel EDGES inject the most energy, so the
    #    jerk (|d force|) drives a band-limited noise burst, not the force itself.
    force = _seek_envelope(seek_n, p.coast_frac)          # accel ramp / coast / decel ramp
    jerk = np.abs(np.diff(force, prepend=force[:1]))
    jerk_env = force * 0.4 + (jerk / jerk.max()) * 0.6
    current = _bandpass(rng.standard_normal(seek_n), sr, *p.current_band_hz)
    out[seek] += current * jerk_env * p.current_gain

    # 2. Crash-stop thunk: a short band-limited noise burst at arrival.
    out[arrival] += _bandpass(decaying_noise(p.crash_ms), sr, *p.crash_band_hz) * p.crash_gain

    # 3. Settle ring: damped sinusoids at the arm-resonance modes.
    for freq, gain in zip(p.settle_modes_hz, p.settle_gains):
        ring += np.sin(2*np.pi*freq*t) * np.exp(-t / tau) * gain
    out[arrival:] += ring

The free parameters — the three resonance modes, their gains, the decay, the crash level, the current band — are not hand-tuned. An autofit harness fits them to the cleaned reference set (61 click clips) against two objectives: eight scalar landmark features (attack time, ring-down, spectral centroid, mid- and high-band energy ratios, ring peakiness, and the two arm-resonance peaks), each scored against a [p25, p75] tolerance band taken from the reference set, plus a masked log-band spectral cosine over the bands that sit above the recording’s noise floor. The fit is the part I trust the most, because it has a number on it.

Log-band spectrum of the VCA baseline synth overlaid on the real reference, residual 0.087
Frozen VCA baseline (blue) against the median of the 61 reference clips (black), over the log-frequency bands above the recording's noise floor. The three synth peaks are the fitted arm-resonance modes (960 / 2641 / 3756 Hz); the masked spectral residual is ≈ 0.087.

The fitted timbre is frozen as a preset; the runtime drives only two knobs on top of it (pitch, damping) plus the per-seek distance.

Hearing the difference

This is the rewrite that justified itself. Play the voice-coil voice against the two stepper clips above — same two moves, different machine.

A long move: where the stepper buzzed, the voice-coil splits into launch and arrival.

Voice-coil, full stroke — two separated transients.

A short move: where the stepper still buzzed, the voice-coil collapses to a single tight click.

Voice-coil, short seek — one click.

And a burst of random-access seeks — the kind of texture the app produces under load:

Voice-coil chatter, a train of random-distance seeks.

The cabinet matters too

A bare seek voice still sounds synthetic, because much of what you hear from a real drive is the box. A 3.5" HDD case is far too small to reverberate — its Schroeder frequency sits near the top of the audible band (≈ 20 kHz for a box that size), so there is effectively no diffuse-field regime inside the audio range. It cannot be modelled as a reverb. Instead it is treated in the modal regime: a bank of rigid rectangular-cavity eigenmodes (the resonant body) plus image-source early reflections (the metallic attack ring). The two describe the same physical field, so they are exposed as independent amounts rather than summed at full strength, which would double-count the energy.

A dry seek, then the same seek through the two layers and the blend, in a 3.5" cavity:

3.5" — dry seek, no enclosure.
3.5" — modal only (resonant body).
3.5" — image only (early-reflection ring).
3.5" — hybrid blend.

The cavity geometry is a parameter. A 2.5" laptop-drive enclosure is a smaller box, so its modes sit higher and tighter — the same seek reads as a harder, drier click:

2.5" — dry seek.
2.5" — hybrid blend in the smaller cavity.

And the difference is clearest on a chatter burst, where the cabinet glues the individual clicks into one continuous body:

3.5" — chatter, dry.
3.5" — chatter, enclosed.
2.5" — chatter, enclosed (smaller box).

From disk I/O to clicks

A real seek voice still needs something to seek to. The app’s input is `fs_usage` disk-I/O events, and the mapping between an I/O event and a sound went through its own correction. The first version filtered and rate-limited the event stream — and that averaged out the per-event arrival timing, which is exactly the information that carries the drive’s rhythm. Removing the filtering and sonifying events closer to 1:1 brought the rhythm back; the current version coalesces them into activity-driven chatter so a continuous backend produces continuous sound without flooding the UI.

One host I/O does not map to one seek physically, either. Each operation is expanded into a cluster of seeks — a read pulls directory, allocation table, and data regions; a write coalesces — with the seek distances drawn from a measured distribution.

Takeaway

The first engine was not badly tuned; it was the wrong model. No amount of parameter tweaking on a stepper synth was going to produce a voice-coil click, because the stepper and the voice-coil are different machines making different noises. The fix was not in the synth — it was going back to the documented physics of the part that actually makes the sound, fitting the free parameters to a real recording instead of by ear, and putting a number on how close the fit got (≈ 0.087 spectral residual). The cabinet and the I/O-to-seek mapping are the same story in miniature: get the physical model right first, tune second.