Evaluation Metrics for AI Music
Evaluating generated music is challenging because quality is multidimensional and inherently subjective. This page covers the objective metrics, perceptual scores, and human evaluation methods used in the field.
Objective Audio Metrics
Fréchet Audio Distance (FAD)
FAD is the most widely used metric for generative audio quality. It measures the distributional distance between generated and real audio embeddings:
where and are the mean and covariance of embeddings from real and generated audio respectively.
Lower FAD = better quality.
Embedding model choices:
- VGGish: original, widely used but dated
- CLAP: more recent, captures text-audio alignment
- MERT: music-specific embeddings
Fréchet Inception Distance (FID)
FID applied to spectrogram images, using a vision model (e.g., Inception) as the embedding extractor. Less common than FAD for audio but still used in some papers.
Inception Score (IS)
Measures both quality (confident class predictions) and diversity (uniform marginal distribution). Originally for images; adapted for audio with audio classifiers.
Kernel Inception Distance (KID)
Unbiased alternative to FID/FAD, uses Maximum Mean Discrepancy. Better statistical properties with small sample sizes.
Spectral Metrics
Multi-Resolution STFT Loss
Used both for training and evaluation:
where spectral convergence and log-magnitude losses are computed at multiple STFT resolutions.
Spectral convergence:
Log-Spectral Distance (LSD)
Measures per-frame spectral distortion in dB. Lower is better.
Mel Cepstral Distortion (MCD)
where and are mel cepstral coefficients. Widely used in speech synthesis; applicable to singing voice.
Text-Audio Alignment Metrics
CLAP Score
Using a pre-trained CLAP (Contrastive Language-Audio Pretraining) model:
Measures how well the generated audio matches the text prompt. Higher is better.
Text-Audio Relevance
Can also be computed using:
- ImageBind (multimodal alignment)
- MuLan (music-language model)
Musical Attribute Metrics
Tempo Accuracy
Compare detected BPM of generated audio vs. target:
Key Accuracy
Percentage of generated clips where the detected musical key matches the prompted or target key.
Pitch Quality
- F0 RMSE: root mean square error of fundamental frequency trajectory
- Voicing Decision Error: accuracy of voiced/unvoiced detection
- Gross Pitch Error (GPE): percentage of frames with >50 cent pitch error
Rhythm Metrics
- Beat F1: precision and recall of detected beat positions vs. reference
- Downbeat F1: accuracy of measure-level timing
- Groove consistency: autocorrelation analysis of onset patterns
Perceptual Quality Scores
PESQ (Perceptual Evaluation of Speech Quality)
Designed for speech; sometimes repurposed for vocal evaluation. Higher is better.
ViSQOL (Virtual Speech Quality Objective Listener)
Perceptual quality estimator using spectro-temporal comparison:
More robust than PESQ for music content.
SI-SDR (Scale-Invariant Signal-to-Distortion Ratio)
Used primarily for source separation quality. Higher is better.
Human Evaluation
Mean Opinion Score (MOS)
Listeners rate audio samples on a 1–5 scale:
| Score | Quality |
|---|---|
| 5 | Excellent |
| 4 | Good |
| 3 | Fair |
| 2 | Poor |
| 1 | Bad |
MOS is the gold standard but expensive and slow. Design guidelines:
- Use at least 20 listeners
- Randomize presentation order
- Include anchor samples (real music, known-bad examples)
- Report confidence intervals
AB Preference Testing
Present two samples (A and B) and ask which is preferred. Simpler than MOS, captures relative quality.
MUSHRA (Multi-Stimulus with Hidden Reference and Anchor)
- Present multiple versions simultaneously
- Include a hidden reference (real audio)
- Include a low-quality anchor
- Listeners rate each on 0–100 scale
- Good for comparing multiple systems
Attribute Rating
Rate specific dimensions independently:
- Audio quality: production fidelity, absence of artifacts
- Musicality: harmonic coherence, melodic quality
- Text adherence: how well the audio matches the prompt
- Creativity / interestingness: novelty and engagement
- Structure: presence of coherent arrangement
Evaluation Best Practices
| Practice | Reason |
|---|---|
| Report multiple metrics | No single metric captures everything |
| Always include human evaluation | Objective metrics can diverge from perception |
| Use large evaluation sets | Small sets have high variance |
| Compare on the same test set | Ensure fair comparisons |
| Report confidence intervals | Quantify uncertainty |
| Disclose evaluation conditions | Sample rate, duration, number of listeners |
Metric Correlation Summary
| Metric | Correlates With | Limitations |
|---|---|---|
| FAD | Overall distributional quality | Doesn't capture per-sample issues |
| CLAP Score | Text-audio relevance | Bounded by CLAP model quality |
| MOS | Perceived quality | Expensive, subjective variance |
| LSD | Spectral accuracy | Doesn't capture temporal coherence |
| Tempo/Key accuracy | Musical correctness | Narrow attributes only |