Signal Processing Basics
Signal processing provides the mathematical tools that underpin every audio ML pipeline. This page covers the foundational concepts that connect raw audio to the representations that models consume.
Signals and Systemsβ
A signal is a function that conveys information. Audio signals are typically real-valued functions of time:
A system transforms an input signal to an output signal:
Linear Time-Invariant (LTI) Systemsβ
An LTI system satisfies:
- Linearity:
- Time-invariance: if , then
LTI systems are completely characterized by their impulse response .
Convolutionβ
The output of an LTI system is the convolution of input and impulse response:
Convolution Theoremβ
Convolution in time domain equals multiplication in frequency domain:
This is why filtering is efficient via FFT:
- FFT both signals
- Multiply spectra
- Inverse FFT
Complexity: instead of for direct convolution.
Convolution in Neural Networksβ
1D convolutions in audio CNNs operate similarly:
where is the learned kernel, is kernel size, is stride. This is the building block of encoder-decoder architectures in neural codecs and vocoders.
Filteringβ
Low-Pass Filterβ
Passes frequencies below cutoff , attenuates above:
Practical filters have a transition band and ripple. Used for anti-aliasing before downsampling.
High-Pass Filterβ
Passes frequencies above cutoff:
Used for DC removal and removing low-frequency rumble.
Band-Pass Filterβ
Passes frequencies in a range :
Used in multi-band processing and critical band analysis.
Common Filter Designsβ
| Design | Characteristics | Use Case |
|---|---|---|
| Butterworth | Maximally flat passband | General purpose |
| Chebyshev I | Steeper rolloff, passband ripple | Sharp cutoff needed |
| Chebyshev II | Steeper rolloff, stopband ripple | Less common |
| Elliptic | Steepest rolloff, both ripples | Minimum order |
| FIR (windowed) | Linear phase, no feedback | Phase-sensitive applications |
The Sampling Theoremβ
A bandlimited signal with maximum frequency can be perfectly reconstructed from samples taken at rate :
where .
Aliasingβ
If , high frequencies fold back into lower frequencies:
Aliasing creates phantom tones and is irreversible. Anti-aliasing filters must be applied before downsampling.
Resamplingβ
Converting between sample rates is a common operation in audio ML pipelines.
Upsampling by Factor β
- Insert zeros between each sample
- Apply low-pass filter at
Downsampling by Factor β
- Apply anti-aliasing low-pass filter at
- Keep every -th sample
Arbitrary Rate Conversionβ
Combine upsampling by and downsampling by :
Polyphase filter implementations are efficient for large rate changes.
Windowingβ
Multiplying a signal by a window function selects a segment and controls spectral leakage:
Common Windowsβ
| Window | Sidelobe Level | Main Lobe Width | Use |
|---|---|---|---|
| Rectangular | -13 dB | Narrowest | Analysis (no windowing) |
| Hann | -31 dB | Moderate | General STFT |
| Hamming | -43 dB | Moderate | Speech processing |
| Blackman | -58 dB | Wide | High dynamic range |
| Kaiser | Adjustable | Adjustable | Flexible |
Hann window is the default choice for most audio ML STFT computations.
Overlap-Add (OLA)β
For perfect reconstruction in STFT processing:
The Hann window satisfies this constraint with 50% overlap ().
Z-Transform and Transfer Functionsβ
The Z-transform converts discrete sequences to polynomial functions:
A digital filter's transfer function:
- FIR filters: (no feedback, always stable)
- IIR filters: non-trivial (feedback, may be unstable)
Correlation and Autocorrelationβ
Cross-correlation measures similarity between two signals as a function of lag:
Autocorrelation () reveals periodicity:
Autocorrelation peaks indicate:
- Pitch period (strongest peak location)
- Rhythmic period (for onset/energy envelopes)
- Repetitive structure (useful for music structure analysis)
Decibelsβ
The decibel scale is ubiquitous in audio:
Power ratio:
Amplitude ratio:
Common reference levels:
- dBFS (Full Scale): in digital audio
- dBSPL: (threshold of hearing)
- LUFS: integrated loudness (EBU R 128 standard)
Signal Processing in ML Pipelinesβ
| Stage | Signal Processing Operation |
|---|---|
| Input normalization | Gain adjustment, DC removal |
| Resampling | Sample rate conversion |
| Feature extraction | STFT, mel filterbank, log compression |
| Augmentation | Filtering, noise addition, time-stretching |
| Output | Vocoder (mel β waveform), loudness normalization |
| Evaluation | Spectral distance, SNR, correlation metrics |
Understanding these fundamentals helps debug audio ML pipelines β many "model" problems are actually signal processing problems (wrong sample rate, missing anti-aliasing, incorrect windowing, etc.).