Skip to main content

Stable Audio

Stable Audio is Stability AI's text-to-music system, notable for being one of the first to apply latent diffusion — the same paradigm behind Stable Diffusion for images — to long-form music generation.

Architecture

Stable Audio uses a latent diffusion model (LDM) instead of autoregressive token prediction:

Text Prompt ──▶ CLAP/T5 Encoder ──▶ Conditioning


Noise ──▶ ┌─────────────────────┐ │
│ U-Net Denoiser │◀──────┘
│ (latent space) │
└─────────┬───────────┘


┌─────────────────────┐
│ VAE Decoder │──▶ Waveform
└─────────────────────┘

Variational Autoencoder (VAE)

A convolutional VAE compresses audio to a lower-dimensional latent representation:

  • Encoder: waveform xx \rightarrow latent z0RC×T\mathbf{z}_0 \in \mathbb{R}^{C \times T'}
  • Decoder: latent z0\mathbf{z}_0 \rightarrow waveform x^\hat{x}

The VAE is trained with reconstruction + adversarial losses to preserve audio quality at high compression ratios.

Diffusion in Latent Space

Instead of diffusing raw audio or spectrograms, diffusion operates on the compressed latent:

zt=αˉtz0+1αˉtϵ\mathbf{z}_t = \sqrt{\bar{\alpha}_t}\mathbf{z}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon} L=Et,z0,ϵ[ϵϵθ(zt,t,c)2]\mathcal{L} = \mathbb{E}_{t, \mathbf{z}_0, \boldsymbol{\epsilon}} \left[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \mathbf{c})\|^2\right]

This dramatically reduces compute and memory compared to waveform-level diffusion.

U-Net Denoiser

The denoiser is a 1D U-Net adapted for audio latents:

  • Skip connections at multiple scales
  • Cross-attention layers for text conditioning
  • Timestep embedding via sinusoidal + MLP
  • Operates on the temporal dimension of the latent

Conditioning

Stable Audio uses multiple conditioning signals:

  1. Text: CLAP and/or T5 embeddings via cross-attention
  2. Duration: explicit duration conditioning allows specifying output length
  3. Start time: allows generating from a specific position in a theoretical full track

Duration conditioning is a distinctive feature:

cdur=MLP(sinusoidal_embed(d))\mathbf{c}_{\text{dur}} = \text{MLP}(\text{sinusoidal\_embed}(d))

where dd is the target duration in seconds.

Stable Audio 2.0

The second version introduced significant improvements:

  • Longer generation: up to 3 minutes (vs. 90 seconds in v1)
  • Stereo output: native stereo generation
  • Audio-to-audio: conditioning on input audio for style transfer and variation
  • Higher sample rate: 44.1 kHz output
  • Improved structure: better song-level coherence via longer context training

Timing Conditioning

Stable Audio 2.0 uses a timing-aware architecture that encodes:

ctiming=[embed(tstart);embed(ttotal)]\mathbf{c}_{\text{timing}} = [\text{embed}(t_{\text{start}}); \text{embed}(t_{\text{total}})]

This allows the model to understand where it is in the overall track, improving structural coherence.

Training Data

Stable Audio models are trained on licensed music from AudioSparx and other licensed catalogs. Stability AI emphasized using only licensed training data.

Comparison with Autoregressive Approaches

AspectStable Audio (Diffusion)MusicGen (Autoregressive)
Generation paradigmIterative denoisingToken-by-token
Duration flexibilityNative duration controlFixed by sequence length
Inference speedFaster for long audioSlower for long sequences
Edit/inpaintingNaturally supportedRequires masking tricks
Quality characterSmooth, diffusion aestheticSharp, token-level detail

Inference

During inference, the reverse diffusion process generates clean latents from noise:

zt1=1αt(zt1αt1αˉtϵθ(zt,t,c))+σtn\mathbf{z}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{z}_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \mathbf{c})\right) + \sigma_t \mathbf{n}

Classifier-free guidance scales the conditioning effect:

ϵ^=(1+w)ϵθ(zt,t,c)wϵθ(zt,t,)\hat{\boldsymbol{\epsilon}} = (1+w)\boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \mathbf{c}) - w \cdot \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \varnothing)

Open Source

Stable Audio Open was released as an open-weight model, enabling:

  • Community fine-tuning
  • Research experimentation
  • Custom deployment

Engineering Significance

Stable Audio proved that latent diffusion is viable for long-form music generation, bringing the efficiency and flexibility of the Stable Diffusion paradigm to the audio domain. Duration conditioning and timing-aware architectures were particularly influential innovations.