Skip to main content

Latent Space Mapping

Latent spaces are learned coordinate systems where nearby points correspond to perceptually related audio outcomes.

Probabilistic Encoding

A VAE-style encoder maps input xx to a posterior distribution:

qϕ(zx)=N(μϕ(x),σϕ2(x))q_\phi(\mathbf{z}|x)=\mathcal{N}(\boldsymbol{\mu}_\phi(x),\boldsymbol{\sigma}_\phi^2(x))

Sampling uses reparameterization:

z=μ+σϵ,ϵN(0,I)\mathbf{z}=\boldsymbol{\mu}+\boldsymbol{\sigma}\odot\boldsymbol{\epsilon},\quad \boldsymbol{\epsilon}\sim\mathcal{N}(0,\mathbf{I})

Geometry and Musical Semantics

Well-trained latent spaces tend to show:

  • Timbre neighborhoods (similar instrument tone clusters)
  • Style manifolds (genre and production traits)
  • Continuous controls (energy, density, brightness, tension)

These structures make interpolation and editing possible without explicit symbolic rules.

Interpolation

Linear path between two points:

zt=(1t)zA+tzB,t[0,1]\mathbf{z}_t=(1-t)\mathbf{z}_A+t\mathbf{z}_B,\quad t\in[0,1]

Spherical interpolation preserves norm and often sounds smoother:

slerp(zA,zB;t)=sin((1t)θ)sinθzA+sin(tθ)sinθzB\text{slerp}(\mathbf{z}_A,\mathbf{z}_B;t)=\frac{\sin((1-t)\theta)}{\sin\theta}\mathbf{z}_A+\frac{\sin(t\theta)}{\sin\theta}\mathbf{z}_B

where

θ=arccos(zAzBzAzB)\theta=\arccos\left(\frac{\mathbf{z}_A\cdot\mathbf{z}_B}{\|\mathbf{z}_A\|\|\mathbf{z}_B\|}\right)

Training Objective (ELBO)

LELBO=Eqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\mathcal{L}_{\text{ELBO}}=\mathbb{E}_{q_\phi(\mathbf{z}|x)}[\log p_\theta(x|\mathbf{z})]-D_{\text{KL}}\left(q_\phi(\mathbf{z}|x)\,\|\,p(\mathbf{z})\right)
  • Reconstruction term preserves musical detail
  • KL term regularizes the space for stable sampling and interpolation