Skip to main content

Signal Processing Basics

Signal processing provides the mathematical tools that underpin every audio ML pipeline. This page covers the foundational concepts that connect raw audio to the representations that models consume.

Signals and Systems​

A signal is a function that conveys information. Audio signals are typically real-valued functions of time:

x:R→R(continuous)x:Z→R(discrete)x: \mathbb{R} \to \mathbb{R} \quad \text{(continuous)} \qquad x: \mathbb{Z} \to \mathbb{R} \quad \text{(discrete)}

A system transforms an input signal to an output signal:

y[n]=H{x[n]}y[n] = \mathcal{H}\{x[n]\}

Linear Time-Invariant (LTI) Systems​

An LTI system satisfies:

  • Linearity: H{ax1+bx2}=aH{x1}+bH{x2}\mathcal{H}\{ax_1 + bx_2\} = a\mathcal{H}\{x_1\} + b\mathcal{H}\{x_2\}
  • Time-invariance: if H{x[n]}=y[n]\mathcal{H}\{x[n]\} = y[n], then H{x[nβˆ’k]}=y[nβˆ’k]\mathcal{H}\{x[n-k]\} = y[n-k]

LTI systems are completely characterized by their impulse response h[n]h[n].

Convolution​

The output of an LTI system is the convolution of input and impulse response:

y[n]=(xβˆ—h)[n]=βˆ‘k=βˆ’βˆžβˆžx[k] h[nβˆ’k]y[n] = (x * h)[n] = \sum_{k=-\infty}^{\infty} x[k] \, h[n-k]

Convolution Theorem​

Convolution in time domain equals multiplication in frequency domain:

F{xβˆ—h}=X(f)β‹…H(f)\mathcal{F}\{x * h\} = X(f) \cdot H(f)

This is why filtering is efficient via FFT:

  1. FFT both signals
  2. Multiply spectra
  3. Inverse FFT

Complexity: O(Nlog⁑N)O(N \log N) instead of O(N2)O(N^2) for direct convolution.

Convolution in Neural Networks​

1D convolutions in audio CNNs operate similarly:

y[n]=βˆ‘k=0Kβˆ’1w[k] x[nβ‹…sβˆ’k]+by[n] = \sum_{k=0}^{K-1} w[k] \, x[n \cdot s - k] + b

where ww is the learned kernel, KK is kernel size, ss is stride. This is the building block of encoder-decoder architectures in neural codecs and vocoders.

Filtering​

Low-Pass Filter​

Passes frequencies below cutoff fcf_c, attenuates above:

H(f)={1∣fβˆ£β‰€fc0∣f∣>fc(ideal)H(f) = \begin{cases} 1 & |f| \leq f_c \\ 0 & |f| > f_c \end{cases} \quad \text{(ideal)}

Practical filters have a transition band and ripple. Used for anti-aliasing before downsampling.

High-Pass Filter​

Passes frequencies above cutoff:

HHP(f)=1βˆ’HLP(f)H_{\text{HP}}(f) = 1 - H_{\text{LP}}(f)

Used for DC removal and removing low-frequency rumble.

Band-Pass Filter​

Passes frequencies in a range [f1,f2][f_1, f_2]:

HBP(f)={1f1β‰€βˆ£fβˆ£β‰€f20otherwiseH_{\text{BP}}(f) = \begin{cases} 1 & f_1 \leq |f| \leq f_2 \\ 0 & \text{otherwise} \end{cases}

Used in multi-band processing and critical band analysis.

Common Filter Designs​

DesignCharacteristicsUse Case
ButterworthMaximally flat passbandGeneral purpose
Chebyshev ISteeper rolloff, passband rippleSharp cutoff needed
Chebyshev IISteeper rolloff, stopband rippleLess common
EllipticSteepest rolloff, both ripplesMinimum order
FIR (windowed)Linear phase, no feedbackPhase-sensitive applications

The Sampling Theorem​

A bandlimited signal with maximum frequency fmax⁑f_{\max} can be perfectly reconstructed from samples taken at rate fs>2fmax⁑f_s > 2f_{\max}:

x(t)=βˆ‘n=βˆ’βˆžβˆžx[n] sinc(tβˆ’nTsTs)x(t) = \sum_{n=-\infty}^{\infty} x[n] \, \text{sinc}\left(\frac{t - nT_s}{T_s}\right)

where sinc(x)=sin⁑(Ο€x)/(Ο€x)\text{sinc}(x) = \sin(\pi x)/(\pi x).

Aliasing​

If fs<2fmax⁑f_s < 2f_{\max}, high frequencies fold back into lower frequencies:

falias=∣fβˆ’kβ‹…fs∣forΒ someΒ integerΒ kf_{\text{alias}} = |f - k \cdot f_s| \quad \text{for some integer } k

Aliasing creates phantom tones and is irreversible. Anti-aliasing filters must be applied before downsampling.

Resampling​

Converting between sample rates is a common operation in audio ML pipelines.

Upsampling by Factor LL​

  1. Insert Lβˆ’1L-1 zeros between each sample
  2. Apply low-pass filter at fs/(2L)f_s/(2L)

Downsampling by Factor MM​

  1. Apply anti-aliasing low-pass filter at fs/(2M)f_s/(2M)
  2. Keep every MM-th sample

Arbitrary Rate Conversion​

Combine upsampling by LL and downsampling by MM:

fs,new=fsβ‹…LMf_{s,\text{new}} = f_s \cdot \frac{L}{M}

Polyphase filter implementations are efficient for large rate changes.

Windowing​

Multiplying a signal by a window function selects a segment and controls spectral leakage:

xw[n]=x[n]β‹…w[n]x_w[n] = x[n] \cdot w[n]

Common Windows​

WindowSidelobe LevelMain Lobe WidthUse
Rectangular-13 dBNarrowestAnalysis (no windowing)
Hann-31 dBModerateGeneral STFT
Hamming-43 dBModerateSpeech processing
Blackman-58 dBWideHigh dynamic range
KaiserAdjustableAdjustableFlexible

Hann window is the default choice for most audio ML STFT computations.

Overlap-Add (OLA)​

For perfect reconstruction in STFT processing:

βˆ‘mw[nβˆ’mH]2=constant\sum_{m} w[n - mH]^2 = \text{constant}

The Hann window satisfies this constraint with 50% overlap (H=N/2H = N/2).

Z-Transform and Transfer Functions​

The Z-transform converts discrete sequences to polynomial functions:

X(z)=βˆ‘n=βˆ’βˆžβˆžx[n]zβˆ’nX(z) = \sum_{n=-\infty}^{\infty} x[n] z^{-n}

A digital filter's transfer function:

H(z)=B(z)A(z)=βˆ‘k=0Mbkzβˆ’kβˆ‘k=0Nakzβˆ’kH(z) = \frac{B(z)}{A(z)} = \frac{\sum_{k=0}^{M} b_k z^{-k}}{\sum_{k=0}^{N} a_k z^{-k}}
  • FIR filters: A(z)=1A(z) = 1 (no feedback, always stable)
  • IIR filters: non-trivial A(z)A(z) (feedback, may be unstable)

Correlation and Autocorrelation​

Cross-correlation measures similarity between two signals as a function of lag:

Rxy[Ο„]=βˆ‘nx[n] y[n+Ο„]R_{xy}[\tau] = \sum_{n} x[n] \, y[n + \tau]

Autocorrelation (y=xy = x) reveals periodicity:

Rxx[Ο„]=βˆ‘nx[n] x[n+Ο„]R_{xx}[\tau] = \sum_{n} x[n] \, x[n + \tau]

Autocorrelation peaks indicate:

  • Pitch period (strongest peak location)
  • Rhythmic period (for onset/energy envelopes)
  • Repetitive structure (useful for music structure analysis)

Decibels​

The decibel scale is ubiquitous in audio:

Power ratio:

LdB=10log⁑10PPrefL_{\text{dB}} = 10 \log_{10} \frac{P}{P_{\text{ref}}}

Amplitude ratio:

LdB=20log⁑10AArefL_{\text{dB}} = 20 \log_{10} \frac{A}{A_{\text{ref}}}

Common reference levels:

  • dBFS (Full Scale): Aref=1.0A_{\text{ref}} = 1.0 in digital audio
  • dBSPL: Pref=20ΞΌPaP_{\text{ref}} = 20 \mu\text{Pa} (threshold of hearing)
  • LUFS: integrated loudness (EBU R 128 standard)

Signal Processing in ML Pipelines​

StageSignal Processing Operation
Input normalizationGain adjustment, DC removal
ResamplingSample rate conversion
Feature extractionSTFT, mel filterbank, log compression
AugmentationFiltering, noise addition, time-stretching
OutputVocoder (mel β†’ waveform), loudness normalization
EvaluationSpectral distance, SNR, correlation metrics

Understanding these fundamentals helps debug audio ML pipelines β€” many "model" problems are actually signal processing problems (wrong sample rate, missing anti-aliasing, incorrect windowing, etc.).