Skip to main content

Music Representations

The choice of how to represent music — as waveforms, spectrograms, symbolic tokens, or learned codes — is one of the most impactful engineering decisions in any AI music system. Each representation trades off between fidelity, compactness, interpretability, and compatibility with different model architectures.

Taxonomy of Representations

Music Representations
┌────────┴────────┐
Continuous Discrete
┌────┴────┐ ┌────┴────┐
Waveform Spectrogram MIDI/Symbolic Codec Tokens

Waveform (Time Domain)

The most raw representation: a sequence of amplitude samples over time.

x=[x0,x1,,xT1],xn[1,1]x = [x_0, x_1, \dots, x_{T-1}], \quad x_n \in [-1, 1]

Properties

PropertyValue
DimensionalityVery high (44,100 samples/sec for CD)
InformationComplete — no information loss
InterpretabilityLow (hard to read musical content from samples)
Model compatibilityWaveNet, SampleRNN, WaveGlow

Advantages

  • Lossless representation
  • No preprocessing artifacts
  • Captures all acoustic detail

Disadvantages

  • Extremely high dimensionality
  • Long-range dependencies are hard to model
  • No explicit frequency structure

Spectrogram (Time-Frequency Domain)

The Short-Time Fourier Transform (STFT) converts waveform to a 2D time-frequency representation:

S(m,k)=n=0N1x[n+mH]w[n]ej2πkn/N2S(m, k) = \left|\sum_{n=0}^{N-1} x[n + mH] \, w[n] \, e^{-j2\pi kn/N}\right|^2

Key Parameters

ParameterSymbolTypical Values
FFT sizeNN1024, 2048, 4096
Hop sizeHH256, 512, 1024
Windoww[n]w[n]Hann, Hamming

Trade-offs

ΔtΔf14π\Delta t \cdot \Delta f \geq \frac{1}{4\pi}

The uncertainty principle: better time resolution means worse frequency resolution, and vice versa.

  • Larger NN → better frequency resolution, worse time resolution
  • Smaller HH → more time frames, more compute

Mel Spectrogram

A perceptually weighted spectrogram using the mel scale (see Mel Spectrograms):

M(m,b)=kWmel(b,k)S(m,k)M(m, b) = \sum_{k} W_{\text{mel}}(b, k) \cdot S(m, k)
PropertyValue
DimensionalityModerate (80–128 mel bands × time frames)
Perceptual alignmentGood — matches human frequency perception
Model compatibilityTacotron, Diffusion, many classifiers
InvertibilityApproximate (requires vocoder for waveform)

The mel spectrogram is the most common intermediate representation in audio ML.

Constant-Q Transform (CQT)

Provides logarithmically spaced frequency bins — one bin per musical semitone:

XCQ(k)=1Nkn=0Nk1x[n]wk[n]ej2πQn/NkX_{\text{CQ}}(k) = \frac{1}{N_k}\sum_{n=0}^{N_k - 1} x[n] \, w_k[n] \, e^{-j2\pi Q n / N_k}

where the window length NkN_k varies per frequency bin to maintain constant Q=fk/ΔfkQ = f_k / \Delta f_k.

PropertyValue
Frequency spacingLogarithmic (semitone-aligned)
Best forPitch tracking, chord recognition, music transcription
DrawbackNon-uniform time resolution across frequencies

Chromagram

A 12-dimensional representation that folds all octaves into a single pitch class distribution:

C(m,p)=kbin(p)S(m,k),p{C,C#,D,,B}C(m, p) = \sum_{k \in \text{bin}(p)} S(m, k), \quad p \in \{C, C\#, D, \dots, B\}
PropertyValue
Dimensionality12 (one per pitch class)
Best forHarmony analysis, chord detection, melody conditioning
LimitationLoses octave information

Used in MusicGen-Melody for melody-conditioned generation.

MIDI and Symbolic Representations

MIDI encodes music as discrete events:

Note On: (pitch=60, velocity=80, time=0.0)
Note Off: (pitch=60, velocity=0, time=0.5)

Properties

PropertyValue
DimensionalityVery low (events, not samples)
InformationPitch, timing, velocity — no timbre or production
InterpretabilityHigh — directly human-readable
Model compatibilityMusic Transformer, MuseNet, Coconet

Encoding Schemes for ML

  • MIDI-like tokens: Note, Time, Velocity as separate token types
  • REMI: Relative Event-based MIDI representation with bar/position tokens
  • Compound tokens: Bundle note attributes into single tokens
  • Piano roll: 2D binary matrix (pitch × time) — image-like

Limitations

  • No timbre, production quality, or audio texture information
  • Cannot represent vocals, effects, or mixing
  • Requires MIDI data (not always available from audio)

Neural Codec Tokens

Discrete codes from trained neural audio codecs (see Neural Audio Codecs):

ct=(ct1,ct2,,ctQ),ctq{1,,K}\mathbf{c}_t = (c_t^1, c_t^2, \dots, c_t^Q), \quad c_t^q \in \{1, \dots, K\}
PropertyValue
DimensionalityCompact (50–75 tokens/sec × Q codebooks)
InformationComplete audio reconstruction
InterpretabilityLow (learned codes, not human-readable)
Model compatibilityMusicGen, AudioLM, SoundStorm

Advantages

  • Discrete → compatible with language model techniques
  • Compact → efficient for long audio
  • Hierarchical → coarse-to-fine generation strategies

Disadvantages

  • Quantization introduces some quality loss
  • Codebook structure is opaque
  • Requires pre-trained codec

Latent Representations (Continuous)

Continuous learned representations from autoencoders or VAEs:

z=Eϕ(x)RC×T\mathbf{z} = E_\phi(x) \in \mathbb{R}^{C \times T'}
PropertyValue
DimensionalityCompact (channel × compressed time)
InformationHigh — encoder trained to preserve quality
Model compatibilityLatent diffusion (Stable Audio), VAE-based systems

Used when continuous diffusion is preferred over discrete autoregressive generation.

Representation Comparison

RepresentationDimInfo LossGen MethodModels
WaveformVery highNoneAR / FlowWaveNet, SampleRNN
SpectrogramHighPhaseDiffusionDiffWave, SpecDiff
Mel spectrogramModeratePhase + freqDiffusion + VocoderTacotron + HiFi-GAN
CQTModeratePhaseClassificationPitch/chord models
MIDIVery lowTimbre, productionAR TransformerMuseNet, Music Transformer
Codec tokensLowSlight qualityAR TransformerMusicGen, AudioLM
LatentLowLearnedDiffusionStable Audio

Choosing a Representation

The right representation depends on your task:

GoalBest Representation
Highest fidelityWaveform
Text-to-music generationCodec tokens or Latent
Music transcriptionCQT or Mel spectrogram
Chord/harmony analysisChromagram
Compositional controlMIDI / symbolic
Source separationSpectrogram (complex)
Real-time generationCodec tokens (compact)