Skip to main content

Real-Time Audio Inference

Deploying AI music models in real-time or near-real-time applications requires careful optimization of latency, throughput, and computational efficiency. This page covers the engineering strategies for fast audio inference.

Latency Requirements

Different applications have different latency budgets:

ApplicationMax LatencyChallenge Level
Live performance<10 msExtreme
Interactive music tools<100 msVery hard
Real-time voice conversion<50 msHard
Streaming generation<1 secondModerate
Batch generationMinutesEasy
Offline processingHoursTrivial
Total latency=Tinput+Tcompute+Toutput+Tnetwork\text{Total latency} = T_{\text{input}} + T_{\text{compute}} + T_{\text{output}} + T_{\text{network}}

Model Optimization Techniques

Quantization

Reduce numerical precision of model weights and activations:

Post-Training Quantization (PTQ):

wq=round(wΔ)Δ,Δ=wmaxwmin2B1w_q = \text{round}\left(\frac{w}{\Delta}\right) \cdot \Delta, \quad \Delta = \frac{w_{\max} - w_{\min}}{2^B - 1}
PrecisionMemorySpeedQuality
FP324 bytes/paramBaselineReference
FP162 bytes/param~2× fasterNear-identical
INT81 byte/param~3–4× fasterSlight degradation
INT40.5 bytes/param~5–6× fasterNoticeable degradation

Quantization-Aware Training (QAT): simulate quantization during training, resulting in better quality at low precision.

For audio models, INT8 quantization typically preserves quality. INT4 may introduce audible artifacts in vocoders.

Knowledge Distillation

Train a smaller "student" model to mimic a larger "teacher":

Ldistill=αLtask+(1α)LKD\mathcal{L}_{\text{distill}} = \alpha \mathcal{L}_{\text{task}} + (1-\alpha) \mathcal{L}_{\text{KD}} LKD=DKL(pstudentpteacher)\mathcal{L}_{\text{KD}} = D_{\text{KL}}(p_{\text{student}} \| p_{\text{teacher}})

Distillation can reduce model size by 2–10× while retaining most quality.

Pruning

Remove unnecessary weights:

Structured pruning: remove entire channels, heads, or layers Unstructured pruning: zero out individual weights by magnitude

wi={wiif wi>θ0otherwisew'_i = \begin{cases} w_i & \text{if } |w_i| > \theta \\ 0 & \text{otherwise} \end{cases}

Typical audio models can be pruned 30–50% with minimal quality loss.

Architecture-Level Optimizations

Reduce attention layers: attention is the bottleneck for long sequences Use efficient attention: Flash Attention, linear attention Reduce model depth: fewer layers with wider channels Cache key-value pairs: for autoregressive models, cache KV states

KV-Cache for Autoregressive Models

In autoregressive generation, caching previous key-value pairs avoids recomputation:

Without cache: each token requires attending to all previous tokens from scratch With cache: only compute attention for the new token against cached KV pairs

SpeedupT1=T\text{Speedup} \approx \frac{T}{1} = T

for generating TT tokens. KV-caching is essential for real-time autoregressive audio.

Diffusion Model Speedups

Diffusion models are inherently slow due to iterative denoising. Several strategies reduce the number of steps:

Fewer Diffusion Steps

Standard: 50–1000 steps. Optimized: 4–20 steps.

DDIM (Denoising Diffusion Implicit Models):

xt1=αˉt1x^0+1αˉt1σt2ϵθ(xt,t)+σtϵx_{t-1} = \sqrt{\bar{\alpha}_{t-1}}\hat{x}_0 + \sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2}\epsilon_\theta(x_t, t) + \sigma_t \epsilon

DDIM enables deterministic sampling with fewer steps (e.g., 50 → 10).

Consistency Models

Train a model to directly predict the final clean sample from any noise level:

fθ(xt,t)x0tf_\theta(x_t, t) \approx x_0 \quad \forall t

Enables 1-step or 2-step generation.

Progressive Distillation

Distill a multi-step diffusion model into fewer steps:

  1. Start with N-step model
  2. Train student to match N/2 steps
  3. Repeat: N/4, N/8, etc.
  4. End with 1–4 step model

Latent Diffusion (Compression-First)

Diffuse in compressed latent space rather than raw audio:

SpeedupTaudioTlatent×CaudioClatent\text{Speedup} \approx \frac{T_{\text{audio}}}{T_{\text{latent}}} \times \frac{C_{\text{audio}}}{C_{\text{latent}}}

Typical latent compression: 64–256× fewer elements than raw audio.

Vocoder Optimization

The vocoder (mel → waveform) is often the latency bottleneck in mel-spectrogram-based systems.

Streaming Vocoders

Process audio in chunks rather than waiting for the full spectrogram:

Mel chunk 1 ──▶ Vocoder ──▶ Audio chunk 1 (output immediately)
Mel chunk 2 ──▶ Vocoder ──▶ Audio chunk 2 (output immediately)
...

Requires causal architecture (no future lookahead).

Faster Vocoder Architectures

VocoderRTF (GPU)Quality
WaveNet0.01×Excellent
HiFi-GAN V1~80×Very good
HiFi-GAN V3~150×Good
Vocos~200×Good
MB-MelGAN~150×Good

RTF = Real-Time Factor (>1× means faster than real-time).

ONNX / TensorRT

Export vocoders to optimized inference formats:

  • ONNX: cross-platform, moderate optimization
  • TensorRT: NVIDIA GPU, aggressive optimization (2–5× over PyTorch)
  • CoreML: Apple Silicon optimization
  • OpenVINO: Intel optimization

Hardware Considerations

GPU Inference

GPUVRAMSuitable For
RTX 306012 GBSmall models, vocoders
RTX 409024 GBMedium models, real-time
A10040/80 GBLarge models, batch
H10080 GBLargest models, lowest latency

CPU Inference

For edge deployment:

  • ONNX Runtime with optimized kernels
  • Quantized INT8 models
  • Limited to small models or vocoders
  • Latency: 10–100× slower than GPU

Edge / Mobile

PlatformFrameworkFeasibility
iOSCoreMLVocoders, small models
AndroidTFLite, ONNXVocoders, small models
Raspberry PiONNXVocoders only
Web (WASM)ONNX.jsVery limited

Streaming Architecture

For real-time applications, implement a streaming pipeline:

Input (continuous) ──▶ Buffer ──▶ Model (chunk processing) ──▶ Output Buffer ──▶ Audio Out
│ │
Input chunk Output chunk
(overlap) (crossfade)

Overlap-Add for Seamless Output

Process overlapping chunks and crossfade:

y[n]=kyk[nkH]w[nkH]y[n] = \sum_{k} y_k[n - kH] \cdot w[n - kH]

where HH is the hop between chunks and ww is a crossfade window.

Buffer Management

Latency=Tinput_buffer+Tprocessing+Toutput_buffer\text{Latency} = T_{\text{input\_buffer}} + T_{\text{processing}} + T_{\text{output\_buffer}}

Minimize buffer sizes while ensuring the model can process within one buffer period.

Benchmarking

Metrics to Track

MetricDefinition
Latency (p50, p99)Time from input to output
ThroughputSamples generated per second
RTFReal-time factor
MemoryPeak GPU/CPU memory usage
Quality (FAD)Audio quality at different speed settings

Latency vs. Quality Trade-off

Most optimization techniques sacrifice some quality for speed. Always benchmark both:

Speed optimization ──▶ Measure latency reduction
──▶ Measure quality change (FAD, MOS)
──▶ Accept only if quality remains above threshold