UPTF / Unified Pitch-Time Field

One operator. Five effects. Smooth morphing between any two impulse responses.

A representation for digital audio in which time-stretch, pitch-shift, convolution, modal resynthesis, and cross-synthesis are five evaluations of one parametric operator. The renderer is differentiable end-to-end. Morphing between impulse responses is continuous in the morph parameter by construction. Mathematically verified in Lean 4. Empirically demonstrated on real material.

The wider claim: FFT-based DSP is the wrong abstraction

Partitioned-FFT convolution is the static-IR state of the art and has been since the early 2000s. It is fast, accurate, and bounded latency. It is also, for an entire class of modern audio problems, the wrong representation. The FFT view of audio has four built-in assumptions:

Each assumption rules out a class of audio effects that audio engineers reach for daily and find that the standard toolchain cannot deliver: a moving listener in a 3D space, a smooth blend between instrument bodies, a hybrid of two rooms, a parametric reverb whose shape is the output of a neural net, a convolver that knows its own gradient. UPTF replaces the assumptions, not the algorithm.

What FFT-based convolution cannot do

CapabilityPartitioned FFTUPTF
Static IR convolution, sub-millisecond latency yes (24× faster than direct) matches in production, slower in reference
Sample-accurate reproduction of an IR to float precision bit-exact when head ≥ IR (verified)
Smooth, artefact-free morphing between two IRs no — partition swaps produce zipper L2-continuous by construction
Differentiable in the IR parameter discrete partition tables; no gradient end-to-end via the renderer
Inverse rendering (fit an IR to a target sound) not in the representation native; encoder learns it
Unification with other effects (time-stretch, pitch-shift) separate algorithms; partial overlap at best five effects as one parametric operator (proved)
Parameter-space audio editing (knob = move in object graph) not in the representation native; the graph is the editing surface

The race for the static-IR problem is over. Partitioned FFT wins. The race for everything else — morphing, learning, parameter editing, cross-synthesis, hybrid effects — is open. UPTF is built for that race.

Evidence: head-length sweep on real cab IRs

UPTF perceptual fidelity (CLAP cosine) and sample-domain L2 error as a function of FIR head length, log scale
Perceptual fidelity (LAION-CLAP audio-embedding cosine to direct-convolution ground truth) and sample-domain error against FIR head length, on a Science 4x12 cab impulse response (~1 s). Three test signals: click (transient), sine sweep (harmonic), drum loop (mixed). At full head length the UPTF reconstruction is bit-exact and CLAP cosine = 1.000 — a formal prediction made concrete. The diagnostic the figure validates: adaptive head sizing is the production answer.

Evidence: smooth morphing trajectory

RMS envelope of UPTF object-graph morph compared against partition-crossfade baseline, 6-second sweep through morph parameter
RMS envelope (dB, 10 ms windows) of a 6-second morph from cab A to cab B using two methods: UPTF object-graph parameter interpolation (purple) and partition-crossfade baseline (brown). Same dry source. The UPTF trajectory glides through the timbral mid-points; the crossfade inherits the discrete swap pattern. The continuity property is formally proved (Lean theorem T4).

Listen

Same dry source, same pair of impulse responses (Darker and Brighter mic positions of the same guitar cab), the morph parameter sweeps continuously from A to B across the file. Headphones recommended.

Dry sourcepink noise + plucky impulses, 6 s
Cab A, staticDarker mic throughout
Cab B, staticBrighter mic throughout
Partition crossfadeindustry baseline morph A→B
UPTF object-graph morphparameter interpolation A→B

Differentiability is the real prize

Sample-domain audio is a 48,000-dimensional vector per second. A learned model that operates on samples directly is fighting the curse of dimensionality. UPTF compresses an audio frame into a small, semantically meaningful parameter vector. The encoder is trained end-to-end: every layer is differentiable, every gradient flows. The renderer is differentiable in those parameters too. The round trip is C1.

Encoder training loss curve on real cab-convolution outputs, with pink-noise held out for evaluation
Encoder + renderer trained end-to-end on real cab-convolution outputs. Training loss (object-structured signals) reduces from 0.5 to 0.32 over 25 epochs. Pink-noise evaluation flatlines at ~1.09 – not a bug, the predicted scope limit: the UPTF representation factors signals into sparse objects + texture; a signal that is texture has no sparse structure to factor out. Most musical content has object structure; synthesised noise does not.

The downstream consequences:

Status

Theoretical framework
7 preprints, 2019–2026
Verification layer
Lean 4 / Mathlib, no sorry
T1 unification
closed
T2 differentiability
closed
T4 morph continuity
closed
T5 sample-accuracy
property-tested, 300+ random IRs
Engineering anchor
VST3 plugin, shipped
Empirical evaluation
8 cab IRs × 5 signals + sweep
Listening test
scaffold ready; 20-listener run pending
Public deposit
forthcoming

What we are not publishing yet

The implementation details, the object representation, the analyser-renderer architecture, the encoder weights, and the production C++ engine source remain unpublished pending the camera-ready DAFx submission. Reviewers under NDA may request the full source.

What this is not

For the impatient

The mathematical claims are machine-checked against Mathlib. The audio examples above are not synthetic — they are produced by the published reference implementation, with no post-processing. The plugin runs in real time on a laptop. None of the implementation is on GitHub yet.

A. Shivakumar · Independent · abhishek.shivakumar@gmail.com · for collaboration, licensing, NDA-gated source access, or review enquiries.