Transformers for Audio
Transformers model long-range dependencies in music by applying attention over tokenized audio representations.
Scaled Dot-Product Attention
This allows each audio token to aggregate context from other positions, improving coherence across bars and sections.
Multi-Head Attention
Each head learns different structure (rhythm, harmony, instrumentation cues).
Tokenization Strategies
Modern systems commonly use:
- Codec tokens (for example EnCodec/SoundStream-like codes)
- Spectrogram patches
- Quantized latent codes from VQ models
Tokenization quality strongly affects downstream generation fidelity and efficiency.
Autoregressive Audio Modeling
Autoregressive transformers are effective for structure and continuation tasks, but sampling can be slow for long sequences.
Text-to-Audio Conditioning
Cross-attention injects text embeddings into the audio token stream:
This maps prompt semantics (genre, mood, instrumentation, arrangement cues) to generation decisions during decoding.