Skip to main content

Evaluation Metrics for AI Music

Evaluating generated music is challenging because quality is multidimensional and inherently subjective. This page covers the objective metrics, perceptual scores, and human evaluation methods used in the field.

Objective Audio Metrics

Fréchet Audio Distance (FAD)

FAD is the most widely used metric for generative audio quality. It measures the distributional distance between generated and real audio embeddings:

FAD=μrμg2+tr(Σr+Σg2(ΣrΣg)1/2)\text{FAD} = \|\boldsymbol{\mu}_r - \boldsymbol{\mu}_g\|^2 + \text{tr}\left(\boldsymbol{\Sigma}_r + \boldsymbol{\Sigma}_g - 2\left(\boldsymbol{\Sigma}_r \boldsymbol{\Sigma}_g\right)^{1/2}\right)

where (μr,Σr)(\boldsymbol{\mu}_r, \boldsymbol{\Sigma}_r) and (μg,Σg)(\boldsymbol{\mu}_g, \boldsymbol{\Sigma}_g) are the mean and covariance of embeddings from real and generated audio respectively.

Lower FAD = better quality.

Embedding model choices:

  • VGGish: original, widely used but dated
  • CLAP: more recent, captures text-audio alignment
  • MERT: music-specific embeddings

Fréchet Inception Distance (FID)

FID applied to spectrogram images, using a vision model (e.g., Inception) as the embedding extractor. Less common than FAD for audio but still used in some papers.

Inception Score (IS)

IS=exp(Ex[DKL(p(yx)p(y))])\text{IS} = \exp\left(\mathbb{E}_x \left[ D_{\text{KL}}(p(y|x) \| p(y)) \right]\right)

Measures both quality (confident class predictions) and diversity (uniform marginal distribution). Originally for images; adapted for audio with audio classifiers.

Kernel Inception Distance (KID)

KID=MMD2({f(xr)},{f(xg)})\text{KID} = \text{MMD}^2(\{f(x_r)\}, \{f(x_g)\})

Unbiased alternative to FID/FAD, uses Maximum Mean Discrepancy. Better statistical properties with small sample sizes.

Spectral Metrics

Multi-Resolution STFT Loss

Used both for training and evaluation:

LMRSTFT=1Mm=1M(Lsc(m)+Lmag(m))\mathcal{L}_{\text{MRSTFT}} = \frac{1}{M}\sum_{m=1}^{M}\left(\mathcal{L}_{\text{sc}}^{(m)} + \mathcal{L}_{\text{mag}}^{(m)}\right)

where spectral convergence and log-magnitude losses are computed at multiple STFT resolutions.

Spectral convergence:

Lsc=STFT(x)STFT(x^)FSTFT(x)F\mathcal{L}_{\text{sc}} = \frac{\||\text{STFT}(x)| - |\text{STFT}(\hat{x})|\|_F}{\||\text{STFT}(x)|\|_F}

Log-Spectral Distance (LSD)

LSD=1Tt=1T1Kk=1K(10log10X(t,k)2X^(t,k)2)2\text{LSD} = \frac{1}{T}\sum_{t=1}^{T}\sqrt{\frac{1}{K}\sum_{k=1}^{K}\left(10\log_{10}\frac{|X(t,k)|^2}{|\hat{X}(t,k)|^2}\right)^2}

Measures per-frame spectral distortion in dB. Lower is better.

Mel Cepstral Distortion (MCD)

MCD=10ln102i=1D(cic^i)2\text{MCD} = \frac{10}{\ln 10}\sqrt{2\sum_{i=1}^{D}(c_i - \hat{c}_i)^2}

where cic_i and c^i\hat{c}_i are mel cepstral coefficients. Widely used in speech synthesis; applicable to singing voice.

Text-Audio Alignment Metrics

CLAP Score

Using a pre-trained CLAP (Contrastive Language-Audio Pretraining) model:

CLAP Score=cos_sim(etext,eaudio)\text{CLAP Score} = \text{cos\_sim}(\mathbf{e}_{\text{text}}, \mathbf{e}_{\text{audio}})

Measures how well the generated audio matches the text prompt. Higher is better.

Text-Audio Relevance

Can also be computed using:

  • ImageBind (multimodal alignment)
  • MuLan (music-language model)

Musical Attribute Metrics

Tempo Accuracy

Compare detected BPM of generated audio vs. target:

Tempo Error=BPMdetectedBPMtarget\text{Tempo Error} = |BPM_{\text{detected}} - BPM_{\text{target}}|

Key Accuracy

Percentage of generated clips where the detected musical key matches the prompted or target key.

Pitch Quality

  • F0 RMSE: root mean square error of fundamental frequency trajectory
  • Voicing Decision Error: accuracy of voiced/unvoiced detection
  • Gross Pitch Error (GPE): percentage of frames with >50 cent pitch error

Rhythm Metrics

  • Beat F1: precision and recall of detected beat positions vs. reference
  • Downbeat F1: accuracy of measure-level timing
  • Groove consistency: autocorrelation analysis of onset patterns

Perceptual Quality Scores

PESQ (Perceptual Evaluation of Speech Quality)

PESQ[0.5,4.5]\text{PESQ} \in [-0.5, 4.5]

Designed for speech; sometimes repurposed for vocal evaluation. Higher is better.

ViSQOL (Virtual Speech Quality Objective Listener)

Perceptual quality estimator using spectro-temporal comparison:

ViSQOL[1.0,5.0]\text{ViSQOL} \in [1.0, 5.0]

More robust than PESQ for music content.

SI-SDR (Scale-Invariant Signal-to-Distortion Ratio)

sproj=s^,ss2ss_{\text{proj}} = \frac{\langle\hat{s}, s\rangle}{\|s\|^2}\, s SI-SDR=10log10sproj2s^sproj2\text{SI-SDR} = 10\log_{10} \frac{\|s_{\text{proj}}\|^2}{\|\hat{s} - s_{\text{proj}}\|^2}

Used primarily for source separation quality. Higher is better.

Human Evaluation

Mean Opinion Score (MOS)

Listeners rate audio samples on a 1–5 scale:

ScoreQuality
5Excellent
4Good
3Fair
2Poor
1Bad

MOS is the gold standard but expensive and slow. Design guidelines:

  • Use at least 20 listeners
  • Randomize presentation order
  • Include anchor samples (real music, known-bad examples)
  • Report confidence intervals

AB Preference Testing

Present two samples (A and B) and ask which is preferred. Simpler than MOS, captures relative quality.

MUSHRA (Multi-Stimulus with Hidden Reference and Anchor)

  • Present multiple versions simultaneously
  • Include a hidden reference (real audio)
  • Include a low-quality anchor
  • Listeners rate each on 0–100 scale
  • Good for comparing multiple systems

Attribute Rating

Rate specific dimensions independently:

  • Audio quality: production fidelity, absence of artifacts
  • Musicality: harmonic coherence, melodic quality
  • Text adherence: how well the audio matches the prompt
  • Creativity / interestingness: novelty and engagement
  • Structure: presence of coherent arrangement

Evaluation Best Practices

PracticeReason
Report multiple metricsNo single metric captures everything
Always include human evaluationObjective metrics can diverge from perception
Use large evaluation setsSmall sets have high variance
Compare on the same test setEnsure fair comparisons
Report confidence intervalsQuantify uncertainty
Disclose evaluation conditionsSample rate, duration, number of listeners

Metric Correlation Summary

MetricCorrelates WithLimitations
FADOverall distributional qualityDoesn't capture per-sample issues
CLAP ScoreText-audio relevanceBounded by CLAP model quality
MOSPerceived qualityExpensive, subjective variance
LSDSpectral accuracyDoesn't capture temporal coherence
Tempo/Key accuracyMusical correctnessNarrow attributes only