Audio Embeddings
Audio embeddings are dense vectors that encode perceptual and structural properties of sound so downstream models can reason over them efficiently.
From Waveform to Embedding
Given a waveform segment , an encoder maps it to a latent vector:
In production systems, the input is often a log-mel spectrogram or neural codec representation rather than raw waveform samples.
Typical Front-End Steps
- Resample and normalize to a fixed sample rate and loudness range
- Frame with STFT to obtain time-frequency bins
- Project to mel bands to reduce dimensionality and align with auditory resolution
- Apply log compression to stabilize dynamic range for training
Mel conversion from frequency (Hz):
Similarity and Retrieval
Embedding spaces support nearest-neighbor lookup and contrastive training.
High cosine similarity usually indicates related instrumentation, texture, or rhythmic profile.
Contrastive Objective (InfoNCE)
For positive pair in a batch:
This objective pulls matched audio/text or audio/audio pairs together and pushes mismatched examples apart, improving controllability in generative pipelines.