Glossary of AI Music Terms
A comprehensive reference for terminology used throughout this handbook and in the broader AI music field.
A-weightingβ
A frequency weighting curve that approximates human hearing sensitivity. Used in loudness measurements to emphasize frequencies where the ear is most sensitive (2β5 kHz).
AAC (Advanced Audio Coding)β
A lossy audio codec that superseded MP3. Default format for Apple and YouTube. Better quality than MP3 at equivalent bitrates.
ADSR (Attack, Decay, Sustain, Release)β
The four stages of a sound's amplitude envelope. Attack is the rise time, Decay is the initial fall, Sustain is the held level, Release is the fade after the note ends.
Aliasingβ
Distortion that occurs when a signal is sampled below the Nyquist rate. High frequencies fold back as phantom low-frequency content.
Attention (Self-Attention)β
A mechanism that allows each element in a sequence to attend to (weight) all other elements. Fundamental to transformer architectures. See Attention Mathematics.
Autoregressive Modelβ
A model that generates output sequentially, with each step conditioned on all previous steps: .
Batch Sizeβ
The number of training examples processed in one forward/backward pass. Larger batches provide more stable gradients but require more memory.
BPM (Beats Per Minute)β
The tempo of a piece of music. 120 BPM means two beats per second.
Chromagramβ
A 12-dimensional representation of audio that collapses all octaves into pitch classes (C, C#, D, ..., B). Used for harmony analysis and melody conditioning.
CLAP (Contrastive Language-Audio Pretraining)β
A model trained to align text and audio in a shared embedding space. Used for evaluation (CLAP score) and conditioning. See Text-Audio Alignment.
Classifier-Free Guidance (CFG)β
A technique that amplifies the effect of conditioning during diffusion model inference by combining conditional and unconditional predictions: .
Codecβ
A system that encodes and decodes audio. Traditional codecs (MP3, Opus) use hand-designed algorithms; neural codecs (EnCodec, SoundStream) use learned networks.
Codebookβ
A dictionary of discrete vector codes used in vector quantization. Each audio frame is mapped to its nearest codebook entry.
Conditioningβ
The mechanism by which external information (text prompts, MIDI, reference audio) influences generation. Implemented via cross-attention, FiLM, or concatenation.
Contrastive Learningβ
A training paradigm that learns representations by pulling similar pairs together and pushing dissimilar pairs apart in embedding space. InfoNCE is the standard loss.
Cross-Attentionβ
Attention where queries come from one sequence (e.g., audio) and keys/values from another (e.g., text). The primary mechanism for injecting text conditioning into audio models.
DAW (Digital Audio Workstation)β
Software for recording, editing, mixing, and producing music. Examples: Ableton Live, Logic Pro, FL Studio, Pro Tools.
Diffusion Modelβ
A generative model that learns to reverse a noise-adding process. Generates samples by iteratively denoising random noise. See Diffusion Models.
Discriminatorβ
The critic network in a GAN that distinguishes real data from generated data. In audio, multi-scale and multi-period discriminators capture different temporal patterns.
Distillation (Knowledge Distillation)β
Training a smaller model to reproduce the behavior of a larger model. Used to create faster, more efficient models.
ELBO (Evidence Lower Bound)β
The training objective for VAEs: .
Embeddingβ
A dense vector representation of data (text, audio, images) in a continuous space. Similar items have similar embeddings.
EnCodecβ
Meta's neural audio codec using residual vector quantization. Compresses audio to discrete tokens at 1.5β24 kbps. Core component of MusicGen.
Epochβ
One complete pass through the entire training dataset.
FAD (FrΓ©chet Audio Distance)β
The standard metric for evaluating generative audio quality. Measures distributional distance between real and generated audio embeddings. Lower is better.
Feature Matching Lossβ
A training loss that compares intermediate layer activations (features) of a discriminator, rather than just its final output. Reduces artifacts in GAN-generated audio.
FFT (Fast Fourier Transform)β
An efficient algorithm for computing the Discrete Fourier Transform, converting time-domain signals to frequency-domain. See FFT.
FiLM (Feature-wise Linear Modulation)β
A conditioning mechanism: . Modulates intermediate features using scale and shift parameters derived from conditioning input.
Fine-Tuningβ
Continuing training of a pre-trained model on a new, typically smaller dataset. Specializes the model for a specific task or domain. See Fine-Tuning & Adaptation.
GAN (Generative Adversarial Network)β
A framework with two networks β generator and discriminator β trained adversarially. In audio, primarily used for vocoders. See GAN Architectures.
Gradient Clippingβ
Limiting the magnitude of gradients during training to prevent exploding gradients and training instability.
Guidance Scaleβ
The parameter in classifier-free guidance that controls the trade-off between prompt adherence and output diversity.
HiFi-GANβ
A widely used GAN-based vocoder that converts mel spectrograms to high-fidelity waveforms. See GAN Architectures.
Hop Sizeβ
The stride (in samples) between successive STFT frames. Smaller hop size = finer time resolution but more frames.
InfoNCEβ
The contrastive learning loss function used in CLAP, MuLan, and similar models. Maximizes similarity of positive pairs relative to negative pairs.
Inpaintingβ
Regenerating a specific region of audio while keeping the surrounding context. Natural in diffusion models.
KL Divergence (Kullback-Leibler Divergence)β
A measure of how one probability distribution differs from another. Used as a regularizer in VAEs to keep the latent distribution close to a prior.
Latent Diffusionβ
Running diffusion in a compressed latent space rather than directly on audio. Faster and more memory-efficient. Used in Stable Audio.
Latent Spaceβ
A learned, compressed coordinate system where data is represented as dense vectors. Nearby points correspond to perceptually similar audio. See Latent Space Mapping.
LoRA (Low-Rank Adaptation)β
A parameter-efficient fine-tuning technique that adds small, trainable low-rank matrices to frozen model weights. See Fine-Tuning & Adaptation.
Loss Functionβ
A mathematical function that measures how well a model's predictions match the target. The model is trained to minimize this function. See Loss Functions.
LUFS (Loudness Units Full Scale)β
A standardized loudness measurement (EBU R 128). Target levels: -14 LUFS for Spotify, -16 LUFS for Apple Music.
Mel Scaleβ
A perceptual frequency scale where equal distances correspond to equal perceived pitch differences. .
Mel Spectrogramβ
A time-frequency representation of audio using mel-scaled frequency bands and log compression. The most common input format for audio ML models. See Mel Spectrograms.
MIDI (Musical Instrument Digital Interface)β
A symbolic music representation encoding notes, velocities, and timing as discrete events. Does not contain audio, only performance data.
Mixed Precisionβ
Training with both FP16 (or BF16) and FP32 operations to reduce memory usage and increase speed while maintaining precision.
MOS (Mean Opinion Score)β
A subjective quality rating (1β5 scale) from human listeners. The gold standard for audio quality evaluation.
MuLanβ
Google's music-specific text-audio alignment model. Used in MusicLM for conditioning.
Multi-Head Attentionβ
Attention computed in parallel across multiple heads, each with different learned projections. Different heads can capture different aspects of musical structure.
Neural Codecβ
A learned audio compression model using encoder-decoder networks with vector quantization. Examples: EnCodec, SoundStream, DAC. See Neural Audio Codecs.
Nyquist Frequencyβ
Half the sample rate (). The maximum frequency that can be represented without aliasing.
Opusβ
State-of-the-art traditional audio codec. Excellent quality at low bitrates. Widely used in streaming and real-time communication.
Overfittingβ
When a model memorizes training data instead of learning general patterns. Recognized by low training loss but poor performance on new data.
PCM (Pulse Code Modulation)β
Standard uncompressed digital audio format. Stores each sample as a fixed-point or floating-point number.
Perceptual Lossβ
A loss function that operates on intermediate features of a perceptual model rather than raw signal values. Produces more natural-sounding results than pixel/sample-level losses.
Quantization (Model)β
Reducing the numerical precision of model weights (e.g., FP32 β INT8) to decrease model size and increase inference speed.
Quantization (Audio)β
The process of mapping continuous amplitude values to discrete levels. Determined by bit depth.
Reparameterization Trickβ
A technique for backpropagating through a sampling operation: where .
Residual Vector Quantization (RVQ)β
A cascaded quantization scheme where each codebook quantizes the residual from the previous codebook. Used in EnCodec, SoundStream, and DAC.
RoPE (Rotary Position Embedding)β
A positional encoding method that encodes relative positions through rotation of query and key vectors. Increasingly standard in transformers.
Sample Rateβ
The number of audio samples captured per second. 44,100 Hz (CD quality) means 44,100 samples per second.
SDR (Signal-to-Distortion Ratio)β
A metric for source separation quality. Higher SDR means cleaner separation.
Softmaxβ
A function that converts a vector of scores into a probability distribution: .
SoundStreamβ
Google's neural audio codec (2021). Pioneer of the RVQ architecture for audio compression. Used in AudioLM and MusicLM.
Spectrogramβ
A visual representation of the frequency content of audio over time. Computed via STFT.
STFT (Short-Time Fourier Transform)β
The Fourier transform applied to overlapping windowed segments of a signal. Produces a time-frequency representation.
Temperatureβ
A parameter that controls the randomness of sampling. Higher temperature = more diverse but less predictable output. Lower temperature = more conservative but more predictable.
Tokenβ
A discrete unit in a sequence. In audio AI, tokens can be codec codes, quantized latent codes, or spectrogram patches.
Transformerβ
A neural network architecture based on self-attention. Dominant in language modeling and increasingly used for audio. See Transformers for Audio.
U-Netβ
An encoder-decoder architecture with skip connections. Standard backbone for diffusion model denoisers. See U-Net for Audio.
VAE (Variational Autoencoder)β
A generative model with a probabilistic encoder and decoder, trained to maximize the evidence lower bound (ELBO). See VAEs for Audio.
Vector Quantization (VQ)β
Mapping continuous vectors to the nearest entry in a learned codebook. Converts continuous representations to discrete tokens.
Vocoderβ
A model that converts intermediate representations (mel spectrograms, latent codes) into audio waveforms. Modern vocoders use GANs (HiFi-GAN, BigVGAN).
VQ-VAE (Vector Quantized VAE)β
A VAE variant that uses discrete codebook quantization instead of continuous Gaussian latent variables. Foundation of Jukebox and neural audio codecs.
Waveformβ
The raw amplitude-over-time representation of audio. The most fundamental audio format.
Window Functionβ
A function applied to signal frames before FFT to reduce spectral leakage. Common choices: Hann, Hamming, Blackman.