Latent spaces are learned coordinate systems where nearby points correspond to perceptually related audio outcomes.
Probabilistic Encodingā
A VAE-style encoder maps input x to a posterior distribution:
qĻā(zā£x)=N(μĻā(x),ĻĻ2ā(x))
Sampling uses reparameterization:
z=μ+Ļāϵ,ϵā¼N(0,I)
Geometry and Musical Semanticsā
Well-trained latent spaces tend to show:
- Timbre neighborhoods (similar instrument tone clusters)
- Style manifolds (genre and production traits)
- Continuous controls (energy, density, brightness, tension)
These structures make interpolation and editing possible without explicit symbolic rules.
Interpolationā
Linear path between two points:
ztā=(1āt)zAā+tzBā,tā[0,1]
Spherical interpolation preserves norm and often sounds smoother:
slerp(zAā,zBā;t)=sinĪøsin((1āt)Īø)āzAā+sinĪøsin(tĪø)āzBā
where
Īø=arccos(ā„zAāā„ā„zBāā„zAāā
zBāā)
Training Objective (ELBO)ā
LELBOā=EqĻā(zā£x)ā[logpĪøā(xā£z)]āDKLā(qĻā(zā£x)ā„p(z))
- Reconstruction term preserves musical detail
- KL term regularizes the space for stable sampling and interpolation