Variational Autoencoders for Audio
Variational Autoencoders (VAEs) are generative models that learn a compressed, continuous latent space from which new audio can be sampled. They are foundational in music AI — serving as the compression backbone in latent diffusion systems and as standalone generative models.
Architecture
Input Audio x ──▶ Encoder q_φ(z|x) ──▶ z (latent) ──▶ Decoder p_θ(x|z) ──▶ Reconstructed x̂
│ ▲
│ Reparameterization │
│ z = μ + σ⊙ε │
└──────────────────────┘
Encoder
Maps input audio to the parameters of a latent distribution:
For audio, the encoder is typically:
- 1D Convolutional: for waveform input (neural codec style)
- 2D Convolutional: for spectrogram input
- Transformer-based: for token sequences
Reparameterization Trick
To backpropagate through sampling:
This converts a stochastic sampling step into a deterministic function of learnable parameters plus independent noise.
Decoder
Reconstructs audio from the latent vector:
Mirror architecture of the encoder with upsampling operations.
Training Objective
The VAE is trained by maximizing the Evidence Lower Bound (ELBO):
Reconstruction Term
Measures how well the decoder reproduces the input. For audio:
In practice, multi-scale spectral losses work better than raw waveform MSE:
KL Divergence Term
Regularizes the latent space to be close to a standard Gaussian:
This term:
- Prevents the encoder from using arbitrarily different regions for different inputs
- Creates a smooth, interpolable latent space
- Enables sampling by drawing
The KL–Reconstruction Trade-off
KL Collapse
If KL weight is too high, the model ignores the latent variable:
Result: decoder generates from unconditional prior, latent codes carry no information.
KL Annealing
Gradually increase KL weight during training:
where increases from 0 to 1 over training. This lets the model first learn good reconstructions, then regularize the latent space.
β-VAE
Use a fixed to control the trade-off:
- : more disentangled latent space, potentially worse reconstruction
- : better reconstruction, less regular latent space
VQ-VAE: Vector Quantized VAE
Replace the continuous Gaussian bottleneck with discrete codes:
Advantages for Audio
- Discrete tokens are compatible with autoregressive transformers
- No KL collapse — codebook utilization replaces KL regularization
- Hierarchical VQ-VAE enables multi-resolution generation (Jukebox)
VQ-VAE Training Loss
Straight-through estimator: gradients pass through the quantization step to the encoder.
Audio-Specific VAE Variants
Convolutional VAE for Spectrograms
- Encoder: 2D Conv → BatchNorm → ReLU → downsample
- Latent: flattened feature map → μ, σ
- Decoder: upsample → 2D TransposeConv → output spectrogram
WaveVAE
- Encoder: 1D dilated convolutions (WaveNet-style)
- Decoder: autoregressive waveform generation conditioned on z
- Very high quality but extremely slow
VAE + GAN (VAE-GAN)
Combine VAE reconstruction with adversarial training:
This produces sharper outputs than pure VAE while maintaining a structured latent space. Used in modern neural codecs (EnCodec, DAC) though they use VQ instead of Gaussian latents.
VAEs as Compression in Latent Diffusion
In latent diffusion models (Stable Audio), the VAE serves as a compression stage:
- Training phase 1: Train VAE to compress audio ↔ latent with high fidelity
- Training phase 2: Train diffusion model in the VAE's latent space
- Inference: Denoise in latent space → decode with VAE decoder
The VAE provides:
- 8–64× temporal compression
- Smooth, continuous latent space suitable for diffusion
- High-quality reconstruction at decoding time
Latent Space Properties
Interpolation
Linear interpolation between two latent codes:
In a well-trained audio VAE, this produces a smooth morph between two sounds.
Arithmetic
Latent space may support semantic operations:
This works to varying degrees depending on the disentanglement of the latent space.
Sampling
Generate new audio by sampling from the prior:
Pure VAE samples tend to be blurry/smooth — this is why diffusion or autoregressive models are typically used for the generation step, with the VAE handling compression only.