Skip to main content

Latent Space Mapping

Latent spaces are learned coordinate systems where nearby points correspond to perceptually related audio outcomes.

Probabilistic Encoding​

A VAE-style encoder maps input xx to a posterior distribution:

qĻ•(z∣x)=N(μϕ(x),ĻƒĻ•2(x))q_\phi(\mathbf{z}|x)=\mathcal{N}(\boldsymbol{\mu}_\phi(x),\boldsymbol{\sigma}_\phi^2(x))

Sampling uses reparameterization:

z=μ+ĻƒāŠ™Ļµ,ϵ∼N(0,I)\mathbf{z}=\boldsymbol{\mu}+\boldsymbol{\sigma}\odot\boldsymbol{\epsilon},\quad \boldsymbol{\epsilon}\sim\mathcal{N}(0,\mathbf{I})

Geometry and Musical Semantics​

Well-trained latent spaces tend to show:

  • Timbre neighborhoods (similar instrument tone clusters)
  • Style manifolds (genre and production traits)
  • Continuous controls (energy, density, brightness, tension)

These structures make interpolation and editing possible without explicit symbolic rules.

Interpolation​

Linear path between two points:

zt=(1āˆ’t)zA+tzB,t∈[0,1]\mathbf{z}_t=(1-t)\mathbf{z}_A+t\mathbf{z}_B,\quad t\in[0,1]

Spherical interpolation preserves norm and often sounds smoother:

slerp(zA,zB;t)=sin⁔((1āˆ’t)Īø)sin⁔θzA+sin⁔(tĪø)sin⁔θzB\text{slerp}(\mathbf{z}_A,\mathbf{z}_B;t)=\frac{\sin((1-t)\theta)}{\sin\theta}\mathbf{z}_A+\frac{\sin(t\theta)}{\sin\theta}\mathbf{z}_B

where

Īø=arccos⁔(zAā‹…zB∄zA∄∄zB∄)\theta=\arccos\left(\frac{\mathbf{z}_A\cdot\mathbf{z}_B}{\|\mathbf{z}_A\|\|\mathbf{z}_B\|}\right)

Training Objective (ELBO)​

LELBO=EqĻ•(z∣x)[log⁔pĪø(x∣z)]āˆ’DKL(qĻ•(z∣x)ā€‰āˆ„ā€‰p(z))\mathcal{L}_{\text{ELBO}}=\mathbb{E}_{q_\phi(\mathbf{z}|x)}[\log p_\theta(x|\mathbf{z})]-D_{\text{KL}}\left(q_\phi(\mathbf{z}|x)\,\|\,p(\mathbf{z})\right)
  • Reconstruction term preserves musical detail
  • KL term regularizes the space for stable sampling and interpolation