Stable Audio
Stable Audio is Stability AI's text-to-music system, notable for being one of the first to apply latent diffusion โ the same paradigm behind Stable Diffusion for images โ to long-form music generation.
Architectureโ
Stable Audio uses a latent diffusion model (LDM) instead of autoregressive token prediction:
Text Prompt โโโถ CLAP/T5 Encoder โโโถ Conditioning
โ
โผ
Noise โโโถ โโโโโโโโโโโโโโโโโโโโโโโ โ
โ U-Net Denoiser โโโโโโโโโ
โ (latent space) โ
โโโโโโโโโโโฌโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโ
โ VAE Decoder โโโโถ Waveform
โโโโโโโโโโโโโโโโโโโโโโโ
Variational Autoencoder (VAE)โ
A convolutional VAE compresses audio to a lower-dimensional latent representation:
- Encoder: waveform latent
- Decoder: latent waveform
The VAE is trained with reconstruction + adversarial losses to preserve audio quality at high compression ratios.
Diffusion in Latent Spaceโ
Instead of diffusing raw audio or spectrograms, diffusion operates on the compressed latent:
This dramatically reduces compute and memory compared to waveform-level diffusion.
U-Net Denoiserโ
The denoiser is a 1D U-Net adapted for audio latents:
- Skip connections at multiple scales
- Cross-attention layers for text conditioning
- Timestep embedding via sinusoidal + MLP
- Operates on the temporal dimension of the latent
Conditioningโ
Stable Audio uses multiple conditioning signals:
- Text: CLAP and/or T5 embeddings via cross-attention
- Duration: explicit duration conditioning allows specifying output length
- Start time: allows generating from a specific position in a theoretical full track
Duration conditioning is a distinctive feature:
where is the target duration in seconds.
Stable Audio 2.0โ
The second version introduced significant improvements:
- Longer generation: up to 3 minutes (vs. 90 seconds in v1)
- Stereo output: native stereo generation
- Audio-to-audio: conditioning on input audio for style transfer and variation
- Higher sample rate: 44.1 kHz output
- Improved structure: better song-level coherence via longer context training
Timing Conditioningโ
Stable Audio 2.0 uses a timing-aware architecture that encodes:
This allows the model to understand where it is in the overall track, improving structural coherence.
Training Dataโ
Stable Audio models are trained on licensed music from AudioSparx and other licensed catalogs. Stability AI emphasized using only licensed training data.
Comparison with Autoregressive Approachesโ
| Aspect | Stable Audio (Diffusion) | MusicGen (Autoregressive) |
|---|---|---|
| Generation paradigm | Iterative denoising | Token-by-token |
| Duration flexibility | Native duration control | Fixed by sequence length |
| Inference speed | Faster for long audio | Slower for long sequences |
| Edit/inpainting | Naturally supported | Requires masking tricks |
| Quality character | Smooth, diffusion aesthetic | Sharp, token-level detail |
Inferenceโ
During inference, the reverse diffusion process generates clean latents from noise:
Classifier-free guidance scales the conditioning effect:
Open Sourceโ
Stable Audio Open was released as an open-weight model, enabling:
- Community fine-tuning
- Research experimentation
- Custom deployment
Engineering Significanceโ
Stable Audio proved that latent diffusion is viable for long-form music generation, bringing the efficiency and flexibility of the Stable Diffusion paradigm to the audio domain. Duration conditioning and timing-aware architectures were particularly influential innovations.