Skip to main content

Music Representations

The choice of how to represent music โ€” as waveforms, spectrograms, symbolic tokens, or learned codes โ€” is one of the most impactful engineering decisions in any AI music system. Each representation trades off between fidelity, compactness, interpretability, and compatibility with different model architectures.

Taxonomy of Representationsโ€‹

Music Representations
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
Continuous Discrete
โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”
Waveform Spectrogram MIDI/Symbolic Codec Tokens

Waveform (Time Domain)โ€‹

The most raw representation: a sequence of amplitude samples over time.

x=[x0,x1,โ€ฆ,xTโˆ’1],xnโˆˆ[โˆ’1,1]x = [x_0, x_1, \dots, x_{T-1}], \quad x_n \in [-1, 1]

Propertiesโ€‹

PropertyValue
DimensionalityVery high (44,100 samples/sec for CD)
InformationComplete โ€” no information loss
InterpretabilityLow (hard to read musical content from samples)
Model compatibilityWaveNet, SampleRNN, WaveGlow

Advantagesโ€‹

  • Lossless representation
  • No preprocessing artifacts
  • Captures all acoustic detail

Disadvantagesโ€‹

  • Extremely high dimensionality
  • Long-range dependencies are hard to model
  • No explicit frequency structure

Spectrogram (Time-Frequency Domain)โ€‹

The Short-Time Fourier Transform (STFT) converts waveform to a 2D time-frequency representation:

S(m,k)=โˆฃโˆ‘n=0Nโˆ’1x[n+mH]โ€‰w[n]โ€‰eโˆ’j2ฯ€kn/Nโˆฃ2S(m, k) = \left|\sum_{n=0}^{N-1} x[n + mH] \, w[n] \, e^{-j2\pi kn/N}\right|^2

Key Parametersโ€‹

ParameterSymbolTypical Values
FFT sizeNN1024, 2048, 4096
Hop sizeHH256, 512, 1024
Windoww[n]w[n]Hann, Hamming

Trade-offsโ€‹

ฮ”tโ‹…ฮ”fโ‰ฅ14ฯ€\Delta t \cdot \Delta f \geq \frac{1}{4\pi}

The uncertainty principle: better time resolution means worse frequency resolution, and vice versa.

  • Larger NN โ†’ better frequency resolution, worse time resolution
  • Smaller HH โ†’ more time frames, more compute

Mel Spectrogramโ€‹

A perceptually weighted spectrogram using the mel scale (see Mel Spectrograms):

M(m,b)=โˆ‘kWmel(b,k)โ‹…S(m,k)M(m, b) = \sum_{k} W_{\text{mel}}(b, k) \cdot S(m, k)
PropertyValue
DimensionalityModerate (80โ€“128 mel bands ร— time frames)
Perceptual alignmentGood โ€” matches human frequency perception
Model compatibilityTacotron, Diffusion, many classifiers
InvertibilityApproximate (requires vocoder for waveform)

The mel spectrogram is the most common intermediate representation in audio ML.

Constant-Q Transform (CQT)โ€‹

Provides logarithmically spaced frequency bins โ€” one bin per musical semitone:

XCQ(k)=1Nkโˆ‘n=0Nkโˆ’1x[n]โ€‰wk[n]โ€‰eโˆ’j2ฯ€Qn/NkX_{\text{CQ}}(k) = \frac{1}{N_k}\sum_{n=0}^{N_k - 1} x[n] \, w_k[n] \, e^{-j2\pi Q n / N_k}

where the window length NkN_k varies per frequency bin to maintain constant Q=fk/ฮ”fkQ = f_k / \Delta f_k.

PropertyValue
Frequency spacingLogarithmic (semitone-aligned)
Best forPitch tracking, chord recognition, music transcription
DrawbackNon-uniform time resolution across frequencies

Chromagramโ€‹

A 12-dimensional representation that folds all octaves into a single pitch class distribution:

C(m,p)=โˆ‘kโˆˆbin(p)S(m,k),pโˆˆ{C,C#,D,โ€ฆ,B}C(m, p) = \sum_{k \in \text{bin}(p)} S(m, k), \quad p \in \{C, C\#, D, \dots, B\}
PropertyValue
Dimensionality12 (one per pitch class)
Best forHarmony analysis, chord detection, melody conditioning
LimitationLoses octave information

Used in MusicGen-Melody for melody-conditioned generation.

MIDI and Symbolic Representationsโ€‹

MIDI encodes music as discrete events:

Note On: (pitch=60, velocity=80, time=0.0)
Note Off: (pitch=60, velocity=0, time=0.5)

Propertiesโ€‹

PropertyValue
DimensionalityVery low (events, not samples)
InformationPitch, timing, velocity โ€” no timbre or production
InterpretabilityHigh โ€” directly human-readable
Model compatibilityMusic Transformer, MuseNet, Coconet

Encoding Schemes for MLโ€‹

  • MIDI-like tokens: Note, Time, Velocity as separate token types
  • REMI: Relative Event-based MIDI representation with bar/position tokens
  • Compound tokens: Bundle note attributes into single tokens
  • Piano roll: 2D binary matrix (pitch ร— time) โ€” image-like

Limitationsโ€‹

  • No timbre, production quality, or audio texture information
  • Cannot represent vocals, effects, or mixing
  • Requires MIDI data (not always available from audio)

Neural Codec Tokensโ€‹

Discrete codes from trained neural audio codecs (see Neural Audio Codecs):

ct=(ct1,ct2,โ€ฆ,ctQ),ctqโˆˆ{1,โ€ฆ,K}\mathbf{c}_t = (c_t^1, c_t^2, \dots, c_t^Q), \quad c_t^q \in \{1, \dots, K\}
PropertyValue
DimensionalityCompact (50โ€“75 tokens/sec ร— Q codebooks)
InformationComplete audio reconstruction
InterpretabilityLow (learned codes, not human-readable)
Model compatibilityMusicGen, AudioLM, SoundStorm

Advantagesโ€‹

  • Discrete โ†’ compatible with language model techniques
  • Compact โ†’ efficient for long audio
  • Hierarchical โ†’ coarse-to-fine generation strategies

Disadvantagesโ€‹

  • Quantization introduces some quality loss
  • Codebook structure is opaque
  • Requires pre-trained codec

Latent Representations (Continuous)โ€‹

Continuous learned representations from autoencoders or VAEs:

z=Eฯ•(x)โˆˆRCร—Tโ€ฒ\mathbf{z} = E_\phi(x) \in \mathbb{R}^{C \times T'}
PropertyValue
DimensionalityCompact (channel ร— compressed time)
InformationHigh โ€” encoder trained to preserve quality
Model compatibilityLatent diffusion (Stable Audio), VAE-based systems

Used when continuous diffusion is preferred over discrete autoregressive generation.

Representation Comparisonโ€‹

RepresentationDimInfo LossGen MethodModels
WaveformVery highNoneAR / FlowWaveNet, SampleRNN
SpectrogramHighPhaseDiffusionDiffWave, SpecDiff
Mel spectrogramModeratePhase + freqDiffusion + VocoderTacotron + HiFi-GAN
CQTModeratePhaseClassificationPitch/chord models
MIDIVery lowTimbre, productionAR TransformerMuseNet, Music Transformer
Codec tokensLowSlight qualityAR TransformerMusicGen, AudioLM
LatentLowLearnedDiffusionStable Audio

Choosing a Representationโ€‹

The right representation depends on your task:

GoalBest Representation
Highest fidelityWaveform
Text-to-music generationCodec tokens or Latent
Music transcriptionCQT or Mel spectrogram
Chord/harmony analysisChromagram
Compositional controlMIDI / symbolic
Source separationSpectrogram (complex)
Real-time generationCodec tokens (compact)