Latent spaces are learned coordinate systems where nearby points correspond to perceptually related audio outcomes.
Probabilistic Encoding
A VAE-style encoder maps input x to a posterior distribution:
qϕ(z∣x)=N(μϕ(x),σϕ2(x))
Sampling uses reparameterization:
z=μ+σ⊙ϵ,ϵ∼N(0,I)
Geometry and Musical Semantics
Well-trained latent spaces tend to show:
- Timbre neighborhoods (similar instrument tone clusters)
- Style manifolds (genre and production traits)
- Continuous controls (energy, density, brightness, tension)
These structures make interpolation and editing possible without explicit symbolic rules.
Interpolation
Linear path between two points:
zt=(1−t)zA+tzB,t∈[0,1]
Spherical interpolation preserves norm and often sounds smoother:
slerp(zA,zB;t)=sinθsin((1−t)θ)zA+sinθsin(tθ)zB
where
θ=arccos(∥zA∥∥zB∥zA⋅zB)
Training Objective (ELBO)
LELBO=Eqϕ(z∣x)[logpθ(x∣z)]−DKL(qϕ(z∣x)∥p(z))
- Reconstruction term preserves musical detail
- KL term regularizes the space for stable sampling and interpolation