Skip to main content

Psychoacoustics

Psychoacoustics studies how humans perceive sound. Understanding these principles is essential for AI music engineering because perceptual relevance — not raw signal accuracy — determines whether generated audio sounds good.

Frequency Perception and the Mel Scale

Human frequency perception is approximately logarithmic: the perceptual distance between 100 Hz and 200 Hz feels similar to the distance between 1000 Hz and 2000 Hz.

The mel scale formalizes this:

m=2595log10(1+f700)m = 2595 \log_{10}\left(1 + \frac{f}{700}\right)

This mapping is fundamental to mel spectrograms, mel-frequency cepstral coefficients (MFCCs), and the filter bank design used in virtually every audio ML system.

Loudness Perception

Perceived loudness does not scale linearly with amplitude. The relationship follows approximately:

Lp0.3L \propto p^{0.3}

where pp is sound pressure. A 10 dB increase is perceived as roughly "twice as loud."

Equal-Loudness Contours (Fletcher–Munson Curves)

Human sensitivity varies with frequency. We are most sensitive around 2–5 kHz (where speech consonants live) and much less sensitive at very low and very high frequencies.

Frequency RangeSensitivityMusical Relevance
20–80 HzLowSub-bass, felt more than heard
80–300 HzModerateBass instruments, warmth
300–2000 HzHighVocals, melodic instruments
2–5 kHzHighestPresence, articulation, clarity
5–10 kHzModerateBrilliance, air
10–20 kHzDecliningSparkle, high harmonics

AI models that optimize purely on waveform MSE may allocate equal importance to all frequencies. Perceptually weighted loss functions improve subjective quality by emphasizing bands where human hearing is most sensitive.

Auditory Masking

Masking occurs when one sound makes another sound inaudible or less perceptible.

Simultaneous (Frequency) Masking

A loud tone at frequency fmf_m raises the hearing threshold for nearby frequencies. The masking effect is asymmetric — it extends further toward higher frequencies.

This property is exploited by:

  • Lossy audio codecs (MP3, AAC, Opus) to discard inaudible components
  • Perceptual loss functions that weight errors by masking thresholds
  • Audio quality metrics that model masked distortion

Temporal Masking

A loud sound masks quieter sounds that occur shortly before (pre-masking, ~5 ms) or after (post-masking, ~50–200 ms) it.

Temporal masking affects how listeners perceive transient accuracy in generated audio. Small timing errors in attacks may be inaudible if a louder event occurs nearby.

Critical Bands and the Bark Scale

The cochlea analyzes sound in overlapping frequency bands called critical bands. There are approximately 24 critical bands spanning the audible range.

The Bark scale maps frequency to critical band rate:

z=13arctan(0.00076f)+3.5[arctan ⁣(f7500)]2z = 13 \arctan(0.00076 f) + 3.5 \left[\arctan\!\left(\frac{f}{7500}\right)\right]^2

Two tones within the same critical band interact (masking, roughness, beating). Tones in different critical bands are perceived more independently.

Pitch Perception

Pitch is a perceptual attribute related primarily to fundamental frequency (f0f_0), but also influenced by harmonics and spectral envelope.

Harmonic Series

Musical sounds typically consist of a fundamental plus overtones:

fn=nf0,n=1,2,3,f_n = n \cdot f_0, \quad n = 1, 2, 3, \dots

The distribution and relative strength of harmonics determines timbre — why a piano and a guitar playing the same note sound different.

Missing Fundamental

Humans can perceive pitch even when the fundamental frequency is absent, inferring it from the spacing of upper harmonics. AI systems that use only spectral magnitude may struggle with this phenomenon.

Spatial Hearing

Humans localize sound using:

  • Interaural Time Difference (ITD): arrival time difference between ears (effective below ~1500 Hz)
  • Interaural Level Difference (ILD): amplitude difference (effective above ~1500 Hz)
  • Head-Related Transfer Function (HRTF): spectral coloring from head/ear geometry

Stereo and spatial audio generation in AI systems can benefit from modeling these cues for more natural-sounding output.

Implications for AI Audio Engineering

Perceptual PrincipleEngineering Application
Mel/Bark scaleUse perceptually spaced frequency representations
MaskingWeight loss functions by audibility
Loudness curvesApply A-weighting or LUFS normalization
Critical bandsDesign filter banks aligned to auditory resolution
Timbre via harmonicsTrain on representations that capture harmonic structure
Spatial cuesGenerate coherent stereo/binaural output

Models trained with perceptual awareness — through mel representations, multi-scale discriminators, or perceptual loss functions — consistently outperform those trained on raw waveform objectives alone.