Diffusion Models for Audio

Diffusion models synthesize audio by learning to reverse a controlled noise process. They are widely used for high-fidelity text-to-audio generation.

Forward Process

q(x_t\mid x_{t-1})=\mathcal{N}(x_t;\sqrt{1-\beta_t}x_{t-1},\beta_t\mathbf{I})

Closed-form sampling from clean data $x_0$ :

x_t=\sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilon,\quad \epsilon\sim\mathcal{N}(0,\mathbf{I})

where $\alpha_t=1-\beta_t$ and $\bar{\alpha}_t=\prod_{s=1}^{t}\alpha_s$ .

p_\theta(x_{t-1}\mid x_t)=\mathcal{N}(x_{t-1};\boldsymbol{\mu}_\theta(x_t,t),\sigma_t^2\mathbf{I})

A U-Net-like denoiser predicts noise $\epsilon_\theta(x_t,t,c)$ , optionally conditioned on text or other controls $c$ .

\mathcal{L}_{\text{simple}}=\mathbb{E}_{t,x_0,\epsilon}[\|\epsilon-\epsilon_\theta(x_t,t,c)\|^2]

This objective is simple, stable, and effective for audio domains.

\hat{\epsilon}_\theta=(1+w)\epsilon_\theta(x_t,t,c)-w\epsilon_\theta(x_t,t,\varnothing)

The guidance scale $w$ trades off prompt adherence and diversity.

Many systems diffuse in compressed latent space:

This reduces memory and compute cost while preserving quality.

Practical systems combine diffusion with:

The final quality depends on data curation, conditioning fidelity, and scheduler design as much as model size.