Skip to main content

Real-Time Audio Inference

Deploying AI music models in real-time or near-real-time applications requires careful optimization of latency, throughput, and computational efficiency. This page covers the engineering strategies for fast audio inference.

Latency Requirements​

Different applications have different latency budgets:

ApplicationMax LatencyChallenge Level
Live performance<10 msExtreme
Interactive music tools<100 msVery hard
Real-time voice conversion<50 msHard
Streaming generation<1 secondModerate
Batch generationMinutesEasy
Offline processingHoursTrivial
TotalΒ latency=Tinput+Tcompute+Toutput+Tnetwork\text{Total latency} = T_{\text{input}} + T_{\text{compute}} + T_{\text{output}} + T_{\text{network}}

Model Optimization Techniques​

Quantization​

Reduce numerical precision of model weights and activations:

Post-Training Quantization (PTQ):

wq=round(wΞ”)β‹…Ξ”,Ξ”=wmaxβ‘βˆ’wmin⁑2Bβˆ’1w_q = \text{round}\left(\frac{w}{\Delta}\right) \cdot \Delta, \quad \Delta = \frac{w_{\max} - w_{\min}}{2^B - 1}
PrecisionMemorySpeedQuality
FP324 bytes/paramBaselineReference
FP162 bytes/param~2Γ— fasterNear-identical
INT81 byte/param~3–4Γ— fasterSlight degradation
INT40.5 bytes/param~5–6Γ— fasterNoticeable degradation

Quantization-Aware Training (QAT): simulate quantization during training, resulting in better quality at low precision.

For audio models, INT8 quantization typically preserves quality. INT4 may introduce audible artifacts in vocoders.

Knowledge Distillation​

Train a smaller "student" model to mimic a larger "teacher":

Ldistill=Ξ±Ltask+(1βˆ’Ξ±)LKD\mathcal{L}_{\text{distill}} = \alpha \mathcal{L}_{\text{task}} + (1-\alpha) \mathcal{L}_{\text{KD}} LKD=DKL(pstudentβˆ₯pteacher)\mathcal{L}_{\text{KD}} = D_{\text{KL}}(p_{\text{student}} \| p_{\text{teacher}})

Distillation can reduce model size by 2–10Γ— while retaining most quality.

Pruning​

Remove unnecessary weights:

Structured pruning: remove entire channels, heads, or layers Unstructured pruning: zero out individual weights by magnitude

wiβ€²={wiif ∣wi∣>ΞΈ0otherwisew'_i = \begin{cases} w_i & \text{if } |w_i| > \theta \\ 0 & \text{otherwise} \end{cases}

Typical audio models can be pruned 30–50% with minimal quality loss.

Architecture-Level Optimizations​

Reduce attention layers: attention is the bottleneck for long sequences Use efficient attention: Flash Attention, linear attention Reduce model depth: fewer layers with wider channels Cache key-value pairs: for autoregressive models, cache KV states

KV-Cache for Autoregressive Models​

In autoregressive generation, caching previous key-value pairs avoids recomputation:

Without cache: each token requires attending to all previous tokens from scratch With cache: only compute attention for the new token against cached KV pairs

Speedupβ‰ˆT1=T\text{Speedup} \approx \frac{T}{1} = T

for generating TT tokens. KV-caching is essential for real-time autoregressive audio.

Diffusion Model Speedups​

Diffusion models are inherently slow due to iterative denoising. Several strategies reduce the number of steps:

Fewer Diffusion Steps​

Standard: 50–1000 steps. Optimized: 4–20 steps.

DDIM (Denoising Diffusion Implicit Models):

xtβˆ’1=Ξ±Λ‰tβˆ’1x^0+1βˆ’Ξ±Λ‰tβˆ’1βˆ’Οƒt2ϡθ(xt,t)+ΟƒtΟ΅x_{t-1} = \sqrt{\bar{\alpha}_{t-1}}\hat{x}_0 + \sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2}\epsilon_\theta(x_t, t) + \sigma_t \epsilon

DDIM enables deterministic sampling with fewer steps (e.g., 50 β†’ 10).

Consistency Models​

Train a model to directly predict the final clean sample from any noise level:

fΞΈ(xt,t)β‰ˆx0βˆ€tf_\theta(x_t, t) \approx x_0 \quad \forall t

Enables 1-step or 2-step generation.

Progressive Distillation​

Distill a multi-step diffusion model into fewer steps:

  1. Start with N-step model
  2. Train student to match N/2 steps
  3. Repeat: N/4, N/8, etc.
  4. End with 1–4 step model

Latent Diffusion (Compression-First)​

Diffuse in compressed latent space rather than raw audio:

Speedupβ‰ˆTaudioTlatentΓ—CaudioClatent\text{Speedup} \approx \frac{T_{\text{audio}}}{T_{\text{latent}}} \times \frac{C_{\text{audio}}}{C_{\text{latent}}}

Typical latent compression: 64–256Γ— fewer elements than raw audio.

Vocoder Optimization​

The vocoder (mel β†’ waveform) is often the latency bottleneck in mel-spectrogram-based systems.

Streaming Vocoders​

Process audio in chunks rather than waiting for the full spectrogram:

Mel chunk 1 ──▢ Vocoder ──▢ Audio chunk 1 (output immediately)
Mel chunk 2 ──▢ Vocoder ──▢ Audio chunk 2 (output immediately)
...

Requires causal architecture (no future lookahead).

Faster Vocoder Architectures​

VocoderRTF (GPU)Quality
WaveNet0.01Γ—Excellent
HiFi-GAN V1~80Γ—Very good
HiFi-GAN V3~150Γ—Good
Vocos~200Γ—Good
MB-MelGAN~150Γ—Good

RTF = Real-Time Factor (>1Γ— means faster than real-time).

ONNX / TensorRT​

Export vocoders to optimized inference formats:

  • ONNX: cross-platform, moderate optimization
  • TensorRT: NVIDIA GPU, aggressive optimization (2–5Γ— over PyTorch)
  • CoreML: Apple Silicon optimization
  • OpenVINO: Intel optimization

Hardware Considerations​

GPU Inference​

GPUVRAMSuitable For
RTX 306012 GBSmall models, vocoders
RTX 409024 GBMedium models, real-time
A10040/80 GBLarge models, batch
H10080 GBLargest models, lowest latency

CPU Inference​

For edge deployment:

  • ONNX Runtime with optimized kernels
  • Quantized INT8 models
  • Limited to small models or vocoders
  • Latency: 10–100Γ— slower than GPU

Edge / Mobile​

PlatformFrameworkFeasibility
iOSCoreMLVocoders, small models
AndroidTFLite, ONNXVocoders, small models
Raspberry PiONNXVocoders only
Web (WASM)ONNX.jsVery limited

Streaming Architecture​

For real-time applications, implement a streaming pipeline:

Input (continuous) ──▢ Buffer ──▢ Model (chunk processing) ──▢ Output Buffer ──▢ Audio Out
β”‚ β”‚
Input chunk Output chunk
(overlap) (crossfade)

Overlap-Add for Seamless Output​

Process overlapping chunks and crossfade:

y[n]=βˆ‘kyk[nβˆ’kH]β‹…w[nβˆ’kH]y[n] = \sum_{k} y_k[n - kH] \cdot w[n - kH]

where HH is the hop between chunks and ww is a crossfade window.

Buffer Management​

Latency=Tinput_buffer+Tprocessing+Toutput_buffer\text{Latency} = T_{\text{input\_buffer}} + T_{\text{processing}} + T_{\text{output\_buffer}}

Minimize buffer sizes while ensuring the model can process within one buffer period.

Benchmarking​

Metrics to Track​

MetricDefinition
Latency (p50, p99)Time from input to output
ThroughputSamples generated per second
RTFReal-time factor
MemoryPeak GPU/CPU memory usage
Quality (FAD)Audio quality at different speed settings

Latency vs. Quality Trade-off​

Most optimization techniques sacrifice some quality for speed. Always benchmark both:

Speed optimization ──▢ Measure latency reduction
──▢ Measure quality change (FAD, MOS)
──▢ Accept only if quality remains above threshold