Skip to main content

U-Net for Audio

The U-Net architecture β€” originally designed for biomedical image segmentation β€” has become the standard denoiser backbone in diffusion-based audio and music generation systems. Its skip connections and multi-scale processing make it well-suited for the iterative denoising process.

Architecture Overview​

The U-Net is an encoder-decoder with skip connections at each resolution:

Input (noisy latent)
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DownBlock │──────────────────────────────┐ skip
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚
β–Ό β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ DownBlock │────────────────────┐ skip β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β–Ό β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ DownBlock │──────┐ skip β”‚ β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚
β–Ό β”‚ β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚
β”‚ Mid Block β”‚ β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚
β–Ό β”‚ β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚
β”‚ UpBlock β”‚β—€β”€β”€β”€β”€β”˜ β”‚ β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β–Ό β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ UpBlock β”‚β—€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚
β–Ό β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ UpBlock β”‚β—€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
β–Ό
Output (predicted noise Ξ΅ or clean signal)

1D vs. 2D U-Net for Audio​

2D U-Net (Spectrogram Domain)​

Treats the spectrogram as an image:

  • Input: (B,C,F,T)(B, C, F, T) β€” batch, channels, frequency, time
  • Uses 2D convolutions
  • Natural for spectrogram diffusion

1D U-Net (Latent / Waveform Domain)​

Operates on temporal sequences:

  • Input: (B,C,T)(B, C, T) β€” batch, channels, time
  • Uses 1D convolutions
  • Used in latent diffusion (Stable Audio) and waveform diffusion

Most modern music generation systems use 1D U-Net on latent representations.

Building Blocks​

Residual Block​

The basic unit at each resolution:

h=x+Conv(SiLU(GroupNorm(Conv(SiLU(GroupNorm(x))))))h = x + \text{Conv}(\text{SiLU}(\text{GroupNorm}(\text{Conv}(\text{SiLU}(\text{GroupNorm}(x))))))

With timestep conditioning:

h=x+Conv(SiLU(GroupNorm(Conv(SiLU(GroupNorm(x)+MLP(temb))))))h = x + \text{Conv}(\text{SiLU}(\text{GroupNorm}(\text{Conv}(\text{SiLU}(\text{GroupNorm}(x) + \text{MLP}(t_{\text{emb}}))))))

Timestep Embedding​

The diffusion timestep tt is encoded as a continuous embedding:

temb=MLP(sinusoidal(t))t_{\text{emb}} = \text{MLP}(\text{sinusoidal}(t))

Sinusoidal encoding (same as transformer positional encoding):

γi(t)={sin⁑(t/100002i/d)even icos⁑(t/100002i/d)odd i\gamma_i(t) = \begin{cases} \sin(t / 10000^{2i/d}) & \text{even } i \\ \cos(t / 10000^{2i/d}) & \text{odd } i \end{cases}

This embedding is added to or modulates intermediate features, telling the network what noise level to expect.

Downsampling​

Reduce temporal resolution:

Down(x)=Conv1d(x,stride=2)orAvgPool1d(x,2)\text{Down}(x) = \text{Conv1d}(x, \text{stride}=2) \quad \text{or} \quad \text{AvgPool1d}(x, 2)

Each downsampling step doubles the channel dimension to maintain information capacity.

Upsampling​

Increase temporal resolution:

Up(x)=Conv1d(Interpolate(x,scale=2))\text{Up}(x) = \text{Conv1d}(\text{Interpolate}(x, \text{scale}=2))

or transposed convolution:

Up(x)=ConvTranspose1d(x,stride=2)\text{Up}(x) = \text{ConvTranspose1d}(x, \text{stride}=2)

Skip Connections​

The defining feature of U-Net. Features from the encoder are concatenated with decoder features at matching resolutions:

hup=UpBlock([skip,hprev])h_{\text{up}} = \text{UpBlock}([\text{skip}, h_{\text{prev}}])

Skip connections preserve high-resolution details that would otherwise be lost through the bottleneck. In audio, this means preserving transient detail, fine harmonic structure, and precise timing.

Attention Layers in U-Net​

Modern audio U-Nets include attention layers at certain resolutions:

Self-Attention​

Applied at lower resolutions (where sequence length is manageable):

SelfAttn(x)=Attention(xWQ,xWK,xWV)\text{SelfAttn}(x) = \text{Attention}(xW^Q, xW^K, xW^V)

Captures long-range dependencies that convolutions miss β€” important for musical structure and repetition.

Cross-Attention (Conditioning)​

Injects text or other conditioning signals:

CrossAttn(x,c)=Attention(xWQ,cWK,cWV)\text{CrossAttn}(x, c) = \text{Attention}(xW^Q, cW^K, cW^V)

where cc is the conditioning signal (text embeddings from T5, CLAP, etc.).

This is how the U-Net learns to follow text prompts during denoising.

Attention Resolution​

Attention is computationally expensive, so it is typically applied only at the lower resolution levels (after downsampling):

ResolutionSequence Length (30s audio)Attention
Full~1500βœ— (too expensive)
/2~750Sometimes
/4~375βœ“
/8~187βœ“
Bottleneck~94βœ“

Conditioning Mechanisms​

Adaptive Group Normalization (AdaGN)​

Modulate normalization parameters based on timestep and class:

AdaGN(h,y)=ysβ‹…GroupNorm(h)+yb\text{AdaGN}(h, y) = y_s \cdot \text{GroupNorm}(h) + y_b

where ysy_s and yby_b are predicted from the timestep/class embedding.

FiLM (Feature-wise Linear Modulation)​

FiLM(h,c)=Ξ³(c)βŠ™h+Ξ²(c)\text{FiLM}(h, c) = \gamma(c) \odot h + \beta(c)

where Ξ³\gamma and Ξ²\beta are learned functions of the conditioning signal.

Concatenative Conditioning​

Concatenate conditioning features with input:

hinput=[x;cproj]h_{\text{input}} = [x; c_{\text{proj}}]

Simple but effective for spatial or temporal conditioning.

U-Net Configurations for Music​

Typical Architecture Parameters​

ParameterSmallMediumLarge
Base channels128256384
Channel multipliers[1,2,4,8][1,2,3,5][1,2,4,4,8]
Res blocks per level223
Attention resolutions[/4, /8][/4, /8][/2, /4, /8]
Attention heads4816
Parameters~50M~400M~900M

Training Considerations​

  • GroupNorm is standard (BatchNorm causes issues with small batches and diffusion timesteps)
  • SiLU (Swish) activation: SiLU(x)=xβ‹…Οƒ(x)\text{SiLU}(x) = x \cdot \sigma(x)
  • Gradient checkpointing saves memory at the cost of speed
  • Flash Attention in attention layers for efficiency

DiT: Diffusion Transformer (Alternative to U-Net)​

Recent work replaces the U-Net with a pure transformer:

hl+1=hl+Attn(AdaLN(hl,c))+FFN(AdaLN(hlβ€²,c))h_{l+1} = h_l + \text{Attn}(\text{AdaLN}(h_l, c)) + \text{FFN}(\text{AdaLN}(h_l', c))

DiT uses Adaptive Layer Normalization (AdaLN) instead of cross-attention for conditioning.

U-Net vs. DiT for Audio​

AspectU-NetDiT
Inductive biasMulti-scale (built-in)Minimal (learned)
Long-range modelingAttention + skipFull attention
ScalingModerateScales well with compute
Training efficiencyGoodHigher data requirements
Current statusDominantEmerging

Most production music generation systems still use U-Net, but DiT is gaining traction for its simplicity and scaling properties.