Skip to main content

Diffusion Models for Audio

Diffusion models synthesize audio by learning to reverse a controlled noise process. They are widely used for high-fidelity text-to-audio generation.

Forward Process

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t\mid x_{t-1})=\mathcal{N}(x_t;\sqrt{1-\beta_t}x_{t-1},\beta_t\mathbf{I})

Closed-form sampling from clean data x0x_0:

xt=αˉtx0+1αˉtϵ,ϵN(0,I)x_t=\sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilon,\quad \epsilon\sim\mathcal{N}(0,\mathbf{I})

where αt=1βt\alpha_t=1-\beta_t and αˉt=s=1tαs\bar{\alpha}_t=\prod_{s=1}^{t}\alpha_s.

Reverse Process

pθ(xt1xt)=N(xt1;μθ(xt,t),σt2I)p_\theta(x_{t-1}\mid x_t)=\mathcal{N}(x_{t-1};\boldsymbol{\mu}_\theta(x_t,t),\sigma_t^2\mathbf{I})

A U-Net-like denoiser predicts noise ϵθ(xt,t,c)\epsilon_\theta(x_t,t,c), optionally conditioned on text or other controls cc.

Training Objective

Lsimple=Et,x0,ϵ[ϵϵθ(xt,t,c)2]\mathcal{L}_{\text{simple}}=\mathbb{E}_{t,x_0,\epsilon}[\|\epsilon-\epsilon_\theta(x_t,t,c)\|^2]

This objective is simple, stable, and effective for audio domains.

Classifier-Free Guidance

ϵ^θ=(1+w)ϵθ(xt,t,c)wϵθ(xt,t,)\hat{\epsilon}_\theta=(1+w)\epsilon_\theta(x_t,t,c)-w\epsilon_\theta(x_t,t,\varnothing)

The guidance scale ww trades off prompt adherence and diversity.

Latent Diffusion for Efficiency

Many systems diffuse in compressed latent space:

  1. Encode waveform/spectrogram to latent z0\mathbf{z}_0
  2. Run diffusion on z\mathbf{z} instead of raw audio
  3. Decode denoised latent to waveform

This reduces memory and compute cost while preserving quality.

Engineering Notes for Music Generation

Practical systems combine diffusion with:

  • text encoders for prompt conditioning
  • temporal control tokens (intro, drop, bridge, outro)
  • post-processing (loudness normalization, limiting, optional stem mixing)

The final quality depends on data curation, conditioning fidelity, and scheduler design as much as model size.