Loss Functions for Audio Generation
Loss design determines what a model improves during training, so it directly shapes musical realism, stability, and controllability.
Reconstruction Losses
Given predicted spectrogram Spred and target Starget:
LL1=TF1t=1∑Tf=1∑F∣Spred(t,f)−Starget(t,f)∣
LL2=TF1t=1∑Tf=1∑F(Spred(t,f)−Starget(t,f))2
L1 often preserves transients better; L2 heavily penalizes large errors.
Adversarial Objectives
For generator G and discriminator D:
Ladv=E[logD(xreal)]+E[log(1−D(G(z)))]
In audio, multi-period and multi-scale discriminators help capture both micro-timbre and long-range rhythmic structure.
Perceptual / Feature Matching Losses
Lperc=l=1∑Lλl∥ϕl(Spred)−ϕl(Starget)∥22
Feature matching reduces metallic artifacts and improves subjective quality compared with pure reconstruction loss.
Diffusion Noise-Prediction Loss
Ldiff=Et,x0,ϵ[∥ϵ−ϵθ(xt,t)∥2]
with
xt=αˉtx0+1−αˉtϵ
This trains the denoiser to recover clean signal structure from different noise levels.
KL Regularization in Latent Models
DKL(qϕ(z∣x)∥p(z))=21i=1∑d(σi2+μi2−1−logσi2)
KL terms prevent latent collapse and support stable sampling at inference time.