Skip to main content

GAN Architectures for Audio

Generative Adversarial Networks (GANs) are primarily used in audio AI as vocoders — converting mel spectrograms or latent representations to high-fidelity waveforms. They are the final stage in many music generation pipelines.

GAN Fundamentals

A GAN consists of two networks trained adversarially:

  • Generator GG: creates fake audio from input features
  • Discriminator DD: distinguishes real audio from generated audio

Minimax Objective

minGmaxD  Expdata[logD(x)]+Ezpz[log(1D(G(z)))]\min_G \max_D \; \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]

In audio vocoders, zz is typically a mel spectrogram or latent code rather than random noise.

Least-Squares GAN (LSGAN)

Most audio GANs use least-squares loss for stability:

LD=E[(D(x)1)2]+E[D(G(s))2]\mathcal{L}_D = \mathbb{E}[(D(x) - 1)^2] + \mathbb{E}[D(G(s))^2] LG=E[(D(G(s))1)2]\mathcal{L}_G = \mathbb{E}[(D(G(s)) - 1)^2]

where ss is the conditioning input (mel spectrogram).

HiFi-GAN

HiFi-GAN (2020) is the most widely used neural vocoder. It achieves both high quality and fast inference.

Generator Architecture

The generator uses transposed convolutions for upsampling, with Multi-Receptive Field Fusion (MRF) blocks:

Mel Spectrogram


ConvTranspose1d (upsample by 8)
│ ──▶ MRF Block (kernels: 3, 7, 11)

ConvTranspose1d (upsample by 8)
│ ──▶ MRF Block

ConvTranspose1d (upsample by 2)
│ ──▶ MRF Block

ConvTranspose1d (upsample by 2)
│ ──▶ MRF Block

Conv1d ──▶ tanh ──▶ Waveform

MRF blocks apply parallel residual blocks with different kernel sizes, then sum outputs. This captures patterns at multiple temporal scales simultaneously.

The total upsampling factor matches the hop size used in mel spectrogram extraction (e.g., 8×8×2×2=2568 \times 8 \times 2 \times 2 = 256).

Multi-Period Discriminator (MPD)

Reshapes 1D audio into 2D patterns at different periods:

xp[t,c]=x[tp+c],c=0,,p1x_p[t, c] = x[t \cdot p + c], \quad c = 0, \dots, p-1

Each sub-discriminator uses 2D convolutions on a different period p{2,3,5,7,11}p \in \{2, 3, 5, 7, 11\}.

This captures periodic patterns at different scales — essential for perceiving pitch and harmonic structure.

Multi-Scale Discriminator (MSD)

Operates on audio at different resolutions:

  • Original resolution
  • 2× downsampled
  • 4× downsampled

Each sub-discriminator uses 1D grouped convolutions with increasing dilation.

Training Losses

LG=Ladv(G)+λfeatLfeat(G)+λmelLmel(G)\mathcal{L}_G = \mathcal{L}_{\text{adv}}(G) + \lambda_{\text{feat}} \mathcal{L}_{\text{feat}}(G) + \lambda_{\text{mel}} \mathcal{L}_{\text{mel}}(G)

Feature matching loss:

Lfeat=kl1Nk,lDk(l)(x)Dk(l)(G(s))1\mathcal{L}_{\text{feat}} = \sum_{k}\sum_{l} \frac{1}{N_{k,l}}\|D_k^{(l)}(x) - D_k^{(l)}(G(s))\|_1

Mel spectrogram loss:

Lmel=M(x)M(G(s))1\mathcal{L}_{\text{mel}} = \|M(x) - M(G(s))\|_1

Typical weights: λfeat=2\lambda_{\text{feat}} = 2, λmel=45\lambda_{\text{mel}} = 45.

BigVGAN

BigVGAN (2023) scales up HiFi-GAN with several improvements:

Key Changes

  1. Anti-aliased activation: Applies low-pass filtering after nonlinearities to reduce aliasing artifacts
AMP(x)=LPF(snake(x))\text{AMP}(x) = \text{LPF}(\text{snake}(x))
  1. Snake activation: Periodic activation function that captures harmonic structure:
snake(x)=x+1αsin2(αx)\text{snake}(x) = x + \frac{1}{\alpha}\sin^2(\alpha x)
  1. Larger model: More channels and layers
  2. Multi-resolution STFT discriminator: Additional discriminator using complex STFT

Performance

BigVGAN achieves state-of-the-art vocoder quality, especially for:

  • Out-of-distribution inputs
  • High-frequency content
  • Unseen speakers and instruments

MelGAN

MelGAN (2019) was one of the first GAN-based vocoders:

  • No feature matching loss — pure adversarial training
  • Window-based discriminator
  • Faster than WaveNet but lower quality than HiFi-GAN
  • Historically important but largely superseded

Multi-Band MelGAN

Splits generation into frequency sub-bands:

x=b=1Bupsample(Gb(s))x = \sum_{b=1}^{B} \text{upsample}(G_b(s))

Each sub-generator handles a different frequency range. The synthesis filter bank combines outputs. Faster inference due to reduced temporal resolution per band.

UnivNet

Combines the best ideas from HiFi-GAN and MelGAN:

  • Location-Variable Convolutions (LVC): spatially adaptive kernels
  • Multi-Resolution Spectrogram Discriminator: operates on multiple STFT resolutions
  • Competitive quality with fast inference

Vocos

A recent alternative that generates STFT magnitudes and phases directly:

(X^,X^)=G(s)(|\hat{X}|, \angle\hat{X}) = G(s) x^=iSTFT(X^ejX^)\hat{x} = \text{iSTFT}(|\hat{X}| \cdot e^{j\angle\hat{X}})

Advantages:

  • Very fast (no temporal upsampling convolutions)
  • Directly produces complex spectrogram
  • Good quality at extremely low computational cost

Audio GAN Training Tips

Discriminator Balance

The discriminator should not overpower the generator:

SymptomCauseFix
Mode collapseD too strongReduce D learning rate
Noisy outputD too weakIncrease D capacity
Training oscillationImbalancedUse gradient penalty, spectral normalization

Gradient Penalty

Regularize discriminator gradients:

LGP=λE[(x^D(x^)21)2]\mathcal{L}_{\text{GP}} = \lambda \mathbb{E}\left[(\|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1)^2\right]

Spectral Normalization

Normalize weight matrices by their spectral norm:

WSN=Wσ(W)W_{\text{SN}} = \frac{W}{\sigma(W)}

where σ(W)\sigma(W) is the largest singular value. Stabilizes discriminator training.

GAN Vocoders in Music AI Pipelines

Text ──▶ [Text Encoder] ──▶ [Diffusion/AR Model] ──▶ Mel Spectrogram ──▶ [GAN Vocoder] ──▶ Waveform

The GAN vocoder is often the last mile of the generation pipeline. Its quality directly impacts perceived output quality, making vocoder selection a critical engineering decision.

VocoderQualitySpeed (RTF)Best For
HiFi-GAN V1Very good~80× RTGeneral purpose
BigVGANExcellent~40× RTHighest quality
VocosGood~200× RTLow latency
Multi-Band MelGANGood~150× RTMobile/edge

RTF = Real-Time Factor (how many times faster than real-time).