Skip to main content

Attention Mathematics

Attention mechanisms are the computational foundation of transformers, and transformers are the dominant architecture in modern music AI. This page covers the mathematics of attention in depth.

Scaled Dot-Product Attention

The basic attention operation:

Attention(Q,K,V)=softmax(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

where:

  • QRn×dkQ \in \mathbb{R}^{n \times d_k} — queries
  • KRm×dkK \in \mathbb{R}^{m \times d_k} — keys
  • VRm×dvV \in \mathbb{R}^{m \times d_v} — values
  • dkd_k — key dimension (scaling factor)

Step by Step

1. Compute attention scores:

S=QKRn×mS = QK^\top \in \mathbb{R}^{n \times m}

Each entry SijS_{ij} measures the relevance of key jj to query ii.

2. Scale:

S=SdkS' = \frac{S}{\sqrt{d_k}}

Scaling prevents softmax saturation. Without it, large dkd_k produces extreme dot products, causing softmax to output near-one-hot distributions with vanishing gradients.

3. Softmax normalization:

Aij=exp(Sij)j=1mexp(Sij)A_{ij} = \frac{\exp(S'_{ij})}{\sum_{j'=1}^{m} \exp(S'_{ij'})}

Each row of AA sums to 1, forming a probability distribution over keys.

4. Weighted combination:

Outputi=j=1mAijVj\text{Output}_i = \sum_{j=1}^{m} A_{ij} V_j

Each output is a weighted mixture of values, with weights determined by query-key similarity.

Multi-Head Attention

Instead of a single attention function, use hh parallel heads with different learned projections:

headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O

where WiQRdmodel×dkW_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}, WiKRdmodel×dkW_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}, WiVRdmodel×dvW_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}, and WORhdv×dmodelW^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}.

Typically dk=dv=dmodel/hd_k = d_v = d_{\text{model}} / h.

Why Multiple Heads?

In music contexts, different heads learn to attend to different aspects:

  • Head A: rhythmic patterns (attend to same beat positions)
  • Head B: harmonic relationships (attend to consonant pitch patterns)
  • Head C: local context (attend to nearby tokens)
  • Head D: structural repeats (attend to similar sections)

Self-Attention vs. Cross-Attention

Self-Attention

QQ, KK, VV all come from the same sequence:

Q=XWQ,K=XWK,V=XWVQ = XW^Q, \quad K = XW^K, \quad V = XW^V

Used for modeling dependencies within the audio token sequence.

Cross-Attention

QQ comes from one sequence (audio), KK and VV from another (text):

Q=XaudioWQ,K=XtextWK,V=XtextWVQ = X_{\text{audio}}W^Q, \quad K = X_{\text{text}}W^K, \quad V = X_{\text{text}}W^V

Cross-attention is how text conditioning is injected into audio generation. Each audio token "looks at" the text embeddings to decide what to generate.

Computational Complexity

Standard self-attention has complexity:

O(T2d)O(T^2 \cdot d)

where TT is sequence length and dd is model dimension.

For music at 50 tokens/sec, a 30-second clip has T=1500T = 1500 tokens. The attention matrix has 15002=2.25M1500^2 = 2.25M entries — manageable.

For raw audio at 24 kHz, a 30-second clip has T=720,000T = 720{,}000 tokens. Standard attention is infeasible.

Efficient Attention Mechanisms

Causal (Autoregressive) Masking

For autoregressive generation, mask future positions:

Aij={softmax(Sij)ji0j>iA_{ij} = \begin{cases} \text{softmax}(S'_{ij}) & j \leq i \\ 0 & j > i \end{cases}

Implemented by setting masked positions to -\infty before softmax.

Sliding Window (Local) Attention

Restrict attention to a local window of size ww:

Aij=0ifij>w/2A_{ij} = 0 \quad \text{if} \quad |i - j| > w/2

Complexity: O(Twd)O(T \cdot w \cdot d) — linear in sequence length.

Used in Mistral and some audio models for efficiency.

Sparse Attention Patterns

Combine local and strided attention:

N(i)={j:ijw}{j:jmods=0}\mathcal{N}(i) = \{j : |i-j| \leq w\} \cup \{j : j \bmod s = 0\}

Jukebox uses this pattern. Complexity: O(TT)O(T \sqrt{T}).

Flash Attention

An IO-aware implementation that computes exact attention without materializing the full T×TT \times T matrix:

  • Tiles the computation to fit in SRAM
  • Fuses softmax, masking, and matrix multiply
  • 2–4× speedup with identical numerical results
  • Standard in modern frameworks (PyTorch, JAX)

Not an approximation — same result, just faster.

Linear Attention

Replace softmax with a kernel approximation:

Attention(Q,K,V)ϕ(Q)(ϕ(K)V)\text{Attention}(Q, K, V) \approx \phi(Q)(\phi(K)^\top V)

where ϕ\phi is a feature map. Complexity: O(Td2)O(T \cdot d^2) — linear in TT.

Examples: Performer, Random Feature Attention. Quality trade-offs exist.

Positional Encoding

Attention is permutation-invariant by default. Positional information must be added explicitly.

Sinusoidal Encoding (Original Transformer)

PE(pos,2i)=sin(pos100002i/d)PE(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d}}\right) PE(pos,2i+1)=cos(pos100002i/d)PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d}}\right)

Rotary Position Embeddings (RoPE)

Encode relative position by rotating query and key vectors:

q~m=RΘmqm,k~n=RΘnkn\tilde{q}_m = R_\Theta^m q_m, \quad \tilde{k}_n = R_\Theta^n k_n

where RΘmR_\Theta^m is a rotation matrix. The dot product q~mk~n\tilde{q}_m^\top \tilde{k}_n depends only on relative position (mn)(m-n).

RoPE is now dominant in language models and increasingly used in audio transformers.

Relative Position Bias

Add a learned or computed bias based on relative position:

Sij=qikj+bijS_{ij} = q_i^\top k_j + b_{i-j}

Used in T5 and some audio models. ALiBi is a linear variant:

Sij=qikjmijS_{ij} = q_i^\top k_j - m \cdot |i - j|

where mm is a head-specific slope.

Attention in Audio: Practical Considerations

Design ChoiceRecommendation for Audio
Positional encodingRoPE or relative bias
Attention typeCausal for AR models, bidirectional for encoders
Number of heads8–32 (scale with model size)
Head dimension64–128
Flash AttentionAlways enable when available
Window sizeAt least 2–4 seconds of audio tokens

The Attention Map as a Musical Analysis Tool

Attention patterns in trained audio models reveal learned musical structure:

  • Diagonal patterns: local temporal dependencies
  • Vertical stripes: globally important tokens (structural boundaries)
  • Periodic patterns: rhythmic/metric structure (every 4 bars, every beat)
  • Block patterns: section-level attention (verse attending to verse)

Visualizing attention maps is a valuable debugging and interpretability tool for music AI researchers.