GAN Architectures for Audio
Generative Adversarial Networks (GANs) are primarily used in audio AI as vocoders — converting mel spectrograms or latent representations to high-fidelity waveforms. They are the final stage in many music generation pipelines.
GAN Fundamentals
A GAN consists of two networks trained adversarially:
- Generator : creates fake audio from input features
- Discriminator : distinguishes real audio from generated audio
Minimax Objective
In audio vocoders, is typically a mel spectrogram or latent code rather than random noise.
Least-Squares GAN (LSGAN)
Most audio GANs use least-squares loss for stability:
where is the conditioning input (mel spectrogram).
HiFi-GAN
HiFi-GAN (2020) is the most widely used neural vocoder. It achieves both high quality and fast inference.
Generator Architecture
The generator uses transposed convolutions for upsampling, with Multi-Receptive Field Fusion (MRF) blocks:
Mel Spectrogram
│
▼
ConvTranspose1d (upsample by 8)
│ ──▶ MRF Block (kernels: 3, 7, 11)
▼
ConvTranspose1d (upsample by 8)
│ ──▶ MRF Block
▼
ConvTranspose1d (upsample by 2)
│ ──▶ MRF Block
▼
ConvTranspose1d (upsample by 2)
│ ──▶ MRF Block
▼
Conv1d ──▶ tanh ──▶ Waveform
MRF blocks apply parallel residual blocks with different kernel sizes, then sum outputs. This captures patterns at multiple temporal scales simultaneously.
The total upsampling factor matches the hop size used in mel spectrogram extraction (e.g., ).
Multi-Period Discriminator (MPD)
Reshapes 1D audio into 2D patterns at different periods:
Each sub-discriminator uses 2D convolutions on a different period .
This captures periodic patterns at different scales — essential for perceiving pitch and harmonic structure.
Multi-Scale Discriminator (MSD)
Operates on audio at different resolutions:
- Original resolution
- 2× downsampled
- 4× downsampled
Each sub-discriminator uses 1D grouped convolutions with increasing dilation.
Training Losses
Feature matching loss:
Mel spectrogram loss:
Typical weights: , .
BigVGAN
BigVGAN (2023) scales up HiFi-GAN with several improvements:
Key Changes
- Anti-aliased activation: Applies low-pass filtering after nonlinearities to reduce aliasing artifacts
- Snake activation: Periodic activation function that captures harmonic structure:
- Larger model: More channels and layers
- Multi-resolution STFT discriminator: Additional discriminator using complex STFT
Performance
BigVGAN achieves state-of-the-art vocoder quality, especially for:
- Out-of-distribution inputs
- High-frequency content
- Unseen speakers and instruments
MelGAN
MelGAN (2019) was one of the first GAN-based vocoders:
- No feature matching loss — pure adversarial training
- Window-based discriminator
- Faster than WaveNet but lower quality than HiFi-GAN
- Historically important but largely superseded
Multi-Band MelGAN
Splits generation into frequency sub-bands:
Each sub-generator handles a different frequency range. The synthesis filter bank combines outputs. Faster inference due to reduced temporal resolution per band.
UnivNet
Combines the best ideas from HiFi-GAN and MelGAN:
- Location-Variable Convolutions (LVC): spatially adaptive kernels
- Multi-Resolution Spectrogram Discriminator: operates on multiple STFT resolutions
- Competitive quality with fast inference
Vocos
A recent alternative that generates STFT magnitudes and phases directly:
Advantages:
- Very fast (no temporal upsampling convolutions)
- Directly produces complex spectrogram
- Good quality at extremely low computational cost
Audio GAN Training Tips
Discriminator Balance
The discriminator should not overpower the generator:
| Symptom | Cause | Fix |
|---|---|---|
| Mode collapse | D too strong | Reduce D learning rate |
| Noisy output | D too weak | Increase D capacity |
| Training oscillation | Imbalanced | Use gradient penalty, spectral normalization |
Gradient Penalty
Regularize discriminator gradients:
Spectral Normalization
Normalize weight matrices by their spectral norm:
where is the largest singular value. Stabilizes discriminator training.
GAN Vocoders in Music AI Pipelines
Text ──▶ [Text Encoder] ──▶ [Diffusion/AR Model] ──▶ Mel Spectrogram ──▶ [GAN Vocoder] ──▶ Waveform
The GAN vocoder is often the last mile of the generation pipeline. Its quality directly impacts perceived output quality, making vocoder selection a critical engineering decision.
| Vocoder | Quality | Speed (RTF) | Best For |
|---|---|---|---|
| HiFi-GAN V1 | Very good | ~80× RT | General purpose |
| BigVGAN | Excellent | ~40× RT | Highest quality |
| Vocos | Good | ~200× RT | Low latency |
| Multi-Band MelGAN | Good | ~150× RT | Mobile/edge |
RTF = Real-Time Factor (how many times faster than real-time).