Stable Audio
Stable Audio is Stability AI's text-to-music system, notable for being one of the first to apply latent diffusion — the same paradigm behind Stable Diffusion for images — to long-form music generation.
Architecture
Stable Audio uses a latent diffusion model (LDM) instead of autoregressive token prediction:
Text Prompt ──▶ CLAP/T5 Encoder ──▶ Conditioning
│
▼
Noise ──▶ ┌─────────────────────┐ │
│ U-Net Denoiser │◀──────┘
│ (latent space) │
└─────────┬───────────┘
│
▼
┌─────────────────────┐
│ VAE Decoder │──▶ Waveform
└─────────────────────┘
Variational Autoencoder (VAE)
A convolutional VAE compresses audio to a lower-dimensional latent representation:
- Encoder: waveform latent
- Decoder: latent waveform
The VAE is trained with reconstruction + adversarial losses to preserve audio quality at high compression ratios.
Diffusion in Latent Space
Instead of diffusing raw audio or spectrograms, diffusion operates on the compressed latent:
This dramatically reduces compute and memory compared to waveform-level diffusion.
U-Net Denoiser
The denoiser is a 1D U-Net adapted for audio latents:
- Skip connections at multiple scales
- Cross-attention layers for text conditioning
- Timestep embedding via sinusoidal + MLP
- Operates on the temporal dimension of the latent
Conditioning
Stable Audio uses multiple conditioning signals:
- Text: CLAP and/or T5 embeddings via cross-attention
- Duration: explicit duration conditioning allows specifying output length
- Start time: allows generating from a specific position in a theoretical full track
Duration conditioning is a distinctive feature:
where is the target duration in seconds.
Stable Audio 2.0
The second version introduced significant improvements:
- Longer generation: up to 3 minutes (vs. 90 seconds in v1)
- Stereo output: native stereo generation
- Audio-to-audio: conditioning on input audio for style transfer and variation
- Higher sample rate: 44.1 kHz output
- Improved structure: better song-level coherence via longer context training
Timing Conditioning
Stable Audio 2.0 uses a timing-aware architecture that encodes:
This allows the model to understand where it is in the overall track, improving structural coherence.
Training Data
Stable Audio models are trained on licensed music from AudioSparx and other licensed catalogs. Stability AI emphasized using only licensed training data.
Comparison with Autoregressive Approaches
| Aspect | Stable Audio (Diffusion) | MusicGen (Autoregressive) |
|---|---|---|
| Generation paradigm | Iterative denoising | Token-by-token |
| Duration flexibility | Native duration control | Fixed by sequence length |
| Inference speed | Faster for long audio | Slower for long sequences |
| Edit/inpainting | Naturally supported | Requires masking tricks |
| Quality character | Smooth, diffusion aesthetic | Sharp, token-level detail |
Inference
During inference, the reverse diffusion process generates clean latents from noise:
Classifier-free guidance scales the conditioning effect:
Open Source
Stable Audio Open was released as an open-weight model, enabling:
- Community fine-tuning
- Research experimentation
- Custom deployment
Engineering Significance
Stable Audio proved that latent diffusion is viable for long-form music generation, bringing the efficiency and flexibility of the Stable Diffusion paradigm to the audio domain. Duration conditioning and timing-aware architectures were particularly influential innovations.