Attention Mathematics
Attention mechanisms are the computational foundation of transformers, and transformers are the dominant architecture in modern music AI. This page covers the mathematics of attention in depth.
Scaled Dot-Product Attention
The basic attention operation:
where:
- — queries
- — keys
- — values
- — key dimension (scaling factor)
Step by Step
1. Compute attention scores:
Each entry measures the relevance of key to query .
2. Scale:
Scaling prevents softmax saturation. Without it, large produces extreme dot products, causing softmax to output near-one-hot distributions with vanishing gradients.
3. Softmax normalization:
Each row of sums to 1, forming a probability distribution over keys.
4. Weighted combination:
Each output is a weighted mixture of values, with weights determined by query-key similarity.
Multi-Head Attention
Instead of a single attention function, use parallel heads with different learned projections:
where , , , and .
Typically .
Why Multiple Heads?
In music contexts, different heads learn to attend to different aspects:
- Head A: rhythmic patterns (attend to same beat positions)
- Head B: harmonic relationships (attend to consonant pitch patterns)
- Head C: local context (attend to nearby tokens)
- Head D: structural repeats (attend to similar sections)
Self-Attention vs. Cross-Attention
Self-Attention
, , all come from the same sequence:
Used for modeling dependencies within the audio token sequence.
Cross-Attention
comes from one sequence (audio), and from another (text):
Cross-attention is how text conditioning is injected into audio generation. Each audio token "looks at" the text embeddings to decide what to generate.
Computational Complexity
Standard self-attention has complexity:
where is sequence length and is model dimension.
For music at 50 tokens/sec, a 30-second clip has tokens. The attention matrix has entries — manageable.
For raw audio at 24 kHz, a 30-second clip has tokens. Standard attention is infeasible.
Efficient Attention Mechanisms
Causal (Autoregressive) Masking
For autoregressive generation, mask future positions:
Implemented by setting masked positions to before softmax.
Sliding Window (Local) Attention
Restrict attention to a local window of size :
Complexity: — linear in sequence length.
Used in Mistral and some audio models for efficiency.
Sparse Attention Patterns
Combine local and strided attention:
Jukebox uses this pattern. Complexity: .
Flash Attention
An IO-aware implementation that computes exact attention without materializing the full matrix:
- Tiles the computation to fit in SRAM
- Fuses softmax, masking, and matrix multiply
- 2–4× speedup with identical numerical results
- Standard in modern frameworks (PyTorch, JAX)
Not an approximation — same result, just faster.
Linear Attention
Replace softmax with a kernel approximation:
where is a feature map. Complexity: — linear in .
Examples: Performer, Random Feature Attention. Quality trade-offs exist.
Positional Encoding
Attention is permutation-invariant by default. Positional information must be added explicitly.
Sinusoidal Encoding (Original Transformer)
Rotary Position Embeddings (RoPE)
Encode relative position by rotating query and key vectors:
where is a rotation matrix. The dot product depends only on relative position .
RoPE is now dominant in language models and increasingly used in audio transformers.
Relative Position Bias
Add a learned or computed bias based on relative position:
Used in T5 and some audio models. ALiBi is a linear variant:
where is a head-specific slope.
Attention in Audio: Practical Considerations
| Design Choice | Recommendation for Audio |
|---|---|
| Positional encoding | RoPE or relative bias |
| Attention type | Causal for AR models, bidirectional for encoders |
| Number of heads | 8–32 (scale with model size) |
| Head dimension | 64–128 |
| Flash Attention | Always enable when available |
| Window size | At least 2–4 seconds of audio tokens |
The Attention Map as a Musical Analysis Tool
Attention patterns in trained audio models reveal learned musical structure:
- Diagonal patterns: local temporal dependencies
- Vertical stripes: globally important tokens (structural boundaries)
- Periodic patterns: rhythmic/metric structure (every 4 bars, every beat)
- Block patterns: section-level attention (verse attending to verse)
Visualizing attention maps is a valuable debugging and interpretability tool for music AI researchers.