Skip to main content

Stable Audio

Stable Audio is Stability AI's text-to-music system, notable for being one of the first to apply latent diffusion โ€” the same paradigm behind Stable Diffusion for images โ€” to long-form music generation.

Architectureโ€‹

Stable Audio uses a latent diffusion model (LDM) instead of autoregressive token prediction:

Text Prompt โ”€โ”€โ–ถ CLAP/T5 Encoder โ”€โ”€โ–ถ Conditioning
โ”‚
โ–ผ
Noise โ”€โ”€โ–ถ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ U-Net Denoiser โ”‚โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ (latent space) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ VAE Decoder โ”‚โ”€โ”€โ–ถ Waveform
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Variational Autoencoder (VAE)โ€‹

A convolutional VAE compresses audio to a lower-dimensional latent representation:

  • Encoder: waveform xโ†’x \rightarrow latent z0โˆˆRCร—Tโ€ฒ\mathbf{z}_0 \in \mathbb{R}^{C \times T'}
  • Decoder: latent z0โ†’\mathbf{z}_0 \rightarrow waveform x^\hat{x}

The VAE is trained with reconstruction + adversarial losses to preserve audio quality at high compression ratios.

Diffusion in Latent Spaceโ€‹

Instead of diffusing raw audio or spectrograms, diffusion operates on the compressed latent:

zt=ฮฑห‰tz0+1โˆ’ฮฑห‰tฯต\mathbf{z}_t = \sqrt{\bar{\alpha}_t}\mathbf{z}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon} L=Et,z0,ฯต[โˆฅฯตโˆ’ฯตฮธ(zt,t,c)โˆฅ2]\mathcal{L} = \mathbb{E}_{t, \mathbf{z}_0, \boldsymbol{\epsilon}} \left[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \mathbf{c})\|^2\right]

This dramatically reduces compute and memory compared to waveform-level diffusion.

U-Net Denoiserโ€‹

The denoiser is a 1D U-Net adapted for audio latents:

  • Skip connections at multiple scales
  • Cross-attention layers for text conditioning
  • Timestep embedding via sinusoidal + MLP
  • Operates on the temporal dimension of the latent

Conditioningโ€‹

Stable Audio uses multiple conditioning signals:

  1. Text: CLAP and/or T5 embeddings via cross-attention
  2. Duration: explicit duration conditioning allows specifying output length
  3. Start time: allows generating from a specific position in a theoretical full track

Duration conditioning is a distinctive feature:

cdur=MLP(sinusoidal_embed(d))\mathbf{c}_{\text{dur}} = \text{MLP}(\text{sinusoidal\_embed}(d))

where dd is the target duration in seconds.

Stable Audio 2.0โ€‹

The second version introduced significant improvements:

  • Longer generation: up to 3 minutes (vs. 90 seconds in v1)
  • Stereo output: native stereo generation
  • Audio-to-audio: conditioning on input audio for style transfer and variation
  • Higher sample rate: 44.1 kHz output
  • Improved structure: better song-level coherence via longer context training

Timing Conditioningโ€‹

Stable Audio 2.0 uses a timing-aware architecture that encodes:

ctiming=[embed(tstart);embed(ttotal)]\mathbf{c}_{\text{timing}} = [\text{embed}(t_{\text{start}}); \text{embed}(t_{\text{total}})]

This allows the model to understand where it is in the overall track, improving structural coherence.

Training Dataโ€‹

Stable Audio models are trained on licensed music from AudioSparx and other licensed catalogs. Stability AI emphasized using only licensed training data.

Comparison with Autoregressive Approachesโ€‹

AspectStable Audio (Diffusion)MusicGen (Autoregressive)
Generation paradigmIterative denoisingToken-by-token
Duration flexibilityNative duration controlFixed by sequence length
Inference speedFaster for long audioSlower for long sequences
Edit/inpaintingNaturally supportedRequires masking tricks
Quality characterSmooth, diffusion aestheticSharp, token-level detail

Inferenceโ€‹

During inference, the reverse diffusion process generates clean latents from noise:

ztโˆ’1=1ฮฑt(ztโˆ’1โˆ’ฮฑt1โˆ’ฮฑห‰tฯตฮธ(zt,t,c))+ฯƒtn\mathbf{z}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{z}_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \mathbf{c})\right) + \sigma_t \mathbf{n}

Classifier-free guidance scales the conditioning effect:

ฯต^=(1+w)ฯตฮธ(zt,t,c)โˆ’wโ‹…ฯตฮธ(zt,t,โˆ…)\hat{\boldsymbol{\epsilon}} = (1+w)\boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \mathbf{c}) - w \cdot \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \varnothing)

Open Sourceโ€‹

Stable Audio Open was released as an open-weight model, enabling:

  • Community fine-tuning
  • Research experimentation
  • Custom deployment

Engineering Significanceโ€‹

Stable Audio proved that latent diffusion is viable for long-form music generation, bringing the efficiency and flexibility of the Stable Diffusion paradigm to the audio domain. Duration conditioning and timing-aware architectures were particularly influential innovations.