Mel Spectrograms
The mel spectrogram is the most common audio representation in modern music ML. It combines spectral analysis with perceptual frequency weighting, producing a compact 2D feature that models can process efficiently.
From Linear Spectrogram to Mel
Step 1: STFT
Compute the Short-Time Fourier Transform (see FFT page):
This produces a complex-valued time-frequency representation with frequency bins.
Step 2: Power Spectrum
Take the squared magnitude:
Step 3: Mel Filter Bank
Apply a bank of triangular filters spaced according to the mel scale:
where is the weight of filter at frequency bin .
Step 4: Log Compression
Apply logarithmic compression to match human loudness perception:
The small constant (typically or ) prevents .
The Mel Scale
The mel scale maps frequency to perceived pitch:
Inverse:
Mel Filter Bank Design
Triangular filters are placed at equally spaced points on the mel scale:
- Choose the number of filters (typically 80 or 128 for music)
- Define frequency range (e.g., [0 Hz, 8000 Hz] or [20 Hz, 16000 Hz])
- Convert limits to mel: ,
- Space points equally in mel domain
- Convert back to Hz for filter center frequencies
- Build overlapping triangular filters
Each filter has center frequency and spans from to :
Filters are narrow at low frequencies (fine pitch resolution) and wide at high frequencies (coarse, matching human perception).
Typical Parameters for Music ML
| Parameter | Typical Value | Notes |
|---|---|---|
| Sample rate | 22050 or 44100 Hz | Higher = more bandwidth |
| FFT size () | 1024 or 2048 | Frequency resolution |
| Hop size () | 256 or 512 | Time resolution |
| Mel bands () | 80 or 128 | Feature dimensionality |
| 0 or 20 Hz | Low frequency cutoff | |
| or 8000 Hz | High frequency cutoff | |
| Log type | Natural log or | Scale convention |
| Power | 1 (magnitude) or 2 (power) | Energy vs. amplitude |
Parameter Trade-offs
- More mel bands → finer frequency detail, larger model input
- Larger FFT → better frequency resolution, coarser time resolution
- Smaller hop → finer time resolution, more frames, slower processing
- Higher → captures high harmonics and brightness cues
Mel Spectrograms vs. Other Representations
| Representation | Frequency Spacing | Phase Info | Dimensionality | Invertible |
|---|---|---|---|---|
| Complex STFT | Linear | Yes | High | Yes (perfect) |
| Magnitude spectrogram | Linear | No | High | Approximate |
| Mel spectrogram | Perceptual | No | Moderate | Approximate |
| CQT | Logarithmic | Optional | Moderate | Approximate |
| MFCC | Perceptual + DCT | No | Low | No |
Inversion: Mel Spectrogram to Audio
Since the mel spectrogram discards phase and compresses frequency, inversion requires estimation.
Griffin-Lim Algorithm
Iterative phase reconstruction:
- Start with random phase
- Apply mel filter bank inverse (approximate)
- Iterate: iSTFT → enforce magnitude → STFT → enforce consistency
Griffin-Lim is fast but produces metallic, artifact-prone audio.
Neural Vocoders (Preferred)
Modern systems use trained neural networks for high-quality inversion:
| Vocoder | Architecture | Quality | Speed |
|---|---|---|---|
| WaveNet | Autoregressive | Excellent | Very slow |
| WaveGlow | Flow-based | Very good | Fast |
| HiFi-GAN | GAN-based | Excellent | Very fast |
| BigVGAN | GAN-based | State-of-the-art | Fast |
| Vocos | ISTFT-based | Very good | Very fast |
HiFi-GAN and BigVGAN are the most common choices in production systems.
Mel Spectrograms in Diffusion Models
Mel spectrograms serve as both training targets and intermediate representations in diffusion-based music generation:
The model learns to denoise mel spectrograms, then a vocoder converts the clean mel spectrogram to waveform.
Dynamic Range Compression Variants
Beyond simple log compression, other approaches exist:
Power-Law Compression
PCEN (Per-Channel Energy Normalization)
PCEN provides automatic gain control and is robust to varying recording conditions. Useful for training on diverse, inconsistently-normalized data.
Implementation Notes
Most frameworks provide mel spectrogram computation:
- torchaudio:
torchaudio.transforms.MelSpectrogram - librosa:
librosa.feature.melspectrogram - tensorflow:
tf.signal.linear_to_mel_weight_matrix+ STFT
Ensure consistent parameter choices between training and inference — mismatched mel parameters will produce garbage outputs.