U-Net for Audio
The U-Net architecture β originally designed for biomedical image segmentation β has become the standard denoiser backbone in diffusion-based audio and music generation systems. Its skip connections and multi-scale processing make it well-suited for the iterative denoising process.
Architecture Overviewβ
The U-Net is an encoder-decoder with skip connections at each resolution:
Input (noisy latent)
β
βΌ
ββββββββββββ
β DownBlock ββββββββββββββββββββββββββββββββ skip
ββββββ¬ββββββ β
βΌ β
ββββββββββββ β
β DownBlock ββββββββββββββββββββββ skip β
ββββββ¬ββββββ β β
βΌ β β
ββββββββββββ β β
β DownBlock ββββββββ skip β β
ββββββ¬ββββββ β β β
βΌ β β β
ββββββββββββ β β β
β Mid Block β β β β
ββββββ¬ββββββ β β β
βΌ β β β
ββββββββββββ β β β
β UpBlock βββββββ β β
ββββββ¬ββββββ β β
βΌ β β
ββββββββββββ β β
β UpBlock ββββββββββββββββββββ β
ββββββ¬ββββββ β
βΌ β
ββββββββββββ β
β UpBlock βββββββββββββββββββββββββββββββ
ββββββ¬ββββββ
βΌ
Output (predicted noise Ξ΅ or clean signal)
1D vs. 2D U-Net for Audioβ
2D U-Net (Spectrogram Domain)β
Treats the spectrogram as an image:
- Input: β batch, channels, frequency, time
- Uses 2D convolutions
- Natural for spectrogram diffusion
1D U-Net (Latent / Waveform Domain)β
Operates on temporal sequences:
- Input: β batch, channels, time
- Uses 1D convolutions
- Used in latent diffusion (Stable Audio) and waveform diffusion
Most modern music generation systems use 1D U-Net on latent representations.
Building Blocksβ
Residual Blockβ
The basic unit at each resolution:
With timestep conditioning:
Timestep Embeddingβ
The diffusion timestep is encoded as a continuous embedding:
Sinusoidal encoding (same as transformer positional encoding):
This embedding is added to or modulates intermediate features, telling the network what noise level to expect.
Downsamplingβ
Reduce temporal resolution:
Each downsampling step doubles the channel dimension to maintain information capacity.
Upsamplingβ
Increase temporal resolution:
or transposed convolution:
Skip Connectionsβ
The defining feature of U-Net. Features from the encoder are concatenated with decoder features at matching resolutions:
Skip connections preserve high-resolution details that would otherwise be lost through the bottleneck. In audio, this means preserving transient detail, fine harmonic structure, and precise timing.
Attention Layers in U-Netβ
Modern audio U-Nets include attention layers at certain resolutions:
Self-Attentionβ
Applied at lower resolutions (where sequence length is manageable):
Captures long-range dependencies that convolutions miss β important for musical structure and repetition.
Cross-Attention (Conditioning)β
Injects text or other conditioning signals:
where is the conditioning signal (text embeddings from T5, CLAP, etc.).
This is how the U-Net learns to follow text prompts during denoising.
Attention Resolutionβ
Attention is computationally expensive, so it is typically applied only at the lower resolution levels (after downsampling):
| Resolution | Sequence Length (30s audio) | Attention |
|---|---|---|
| Full | ~1500 | β (too expensive) |
| /2 | ~750 | Sometimes |
| /4 | ~375 | β |
| /8 | ~187 | β |
| Bottleneck | ~94 | β |
Conditioning Mechanismsβ
Adaptive Group Normalization (AdaGN)β
Modulate normalization parameters based on timestep and class:
where and are predicted from the timestep/class embedding.
FiLM (Feature-wise Linear Modulation)β
where and are learned functions of the conditioning signal.
Concatenative Conditioningβ
Concatenate conditioning features with input:
Simple but effective for spatial or temporal conditioning.
U-Net Configurations for Musicβ
Typical Architecture Parametersβ
| Parameter | Small | Medium | Large |
|---|---|---|---|
| Base channels | 128 | 256 | 384 |
| Channel multipliers | [1,2,4,8] | [1,2,3,5] | [1,2,4,4,8] |
| Res blocks per level | 2 | 2 | 3 |
| Attention resolutions | [/4, /8] | [/4, /8] | [/2, /4, /8] |
| Attention heads | 4 | 8 | 16 |
| Parameters | ~50M | ~400M | ~900M |
Training Considerationsβ
- GroupNorm is standard (BatchNorm causes issues with small batches and diffusion timesteps)
- SiLU (Swish) activation:
- Gradient checkpointing saves memory at the cost of speed
- Flash Attention in attention layers for efficiency
DiT: Diffusion Transformer (Alternative to U-Net)β
Recent work replaces the U-Net with a pure transformer:
DiT uses Adaptive Layer Normalization (AdaLN) instead of cross-attention for conditioning.
U-Net vs. DiT for Audioβ
| Aspect | U-Net | DiT |
|---|---|---|
| Inductive bias | Multi-scale (built-in) | Minimal (learned) |
| Long-range modeling | Attention + skip | Full attention |
| Scaling | Moderate | Scales well with compute |
| Training efficiency | Good | Higher data requirements |
| Current status | Dominant | Emerging |
Most production music generation systems still use U-Net, but DiT is gaining traction for its simplicity and scaling properties.