Skip to main content

MusicGen

MusicGen is Meta's open-source text-to-music model, released in June 2023. It introduced a simple yet effective single-stage transformer architecture that generates music from text prompts or melody conditioning.

Architecture

Unlike MusicLM's cascaded approach, MusicGen uses a single autoregressive transformer that models multiple codebook streams simultaneously.

Text Prompt ──▶ T5 Encoder ──▶ Conditioning


┌──────────────────────────┐
│ Autoregressive │
│ Transformer │──▶ EnCodec Tokens ──▶ Waveform
│ (multi-stream) │
└──────────────────────────┘

EnCodec Tokenization

Audio is compressed into discrete tokens using Meta's EnCodec codec with RVQ. At 32 kHz, this produces 4 codebook streams at 50 tokens per second.

Each timestep has Q=4Q = 4 parallel codebook tokens:

ct=(ct1,ct2,ct3,ct4)\mathbf{c}_t = (c_t^1, c_t^2, c_t^3, c_t^4)

Codebook Interleaving Patterns

The key innovation in MusicGen is how it handles multi-codebook generation. Three patterns were explored:

1. Flat Pattern (Baseline)

Flatten all codebooks into a single sequence — simple but 4× slower:

c11,c12,c13,c14,c21,c22,c23,c24,c_1^1, c_1^2, c_1^3, c_1^4, c_2^1, c_2^2, c_2^3, c_2^4, \dots

2. Parallel Pattern

Generate all codebooks simultaneously with a delay offset:

p(ctqc<t1:Q,ctext)p(c_t^q | c_{<t}^{1:Q}, \mathbf{c}_{\text{text}})

Codebook qq is delayed by q1q-1 steps, allowing the model to condition on coarser codebooks when generating finer ones.

3. Delay Pattern (Best)

A compromise: each codebook is shifted by one step, creating a staircase pattern. This allows single-pass generation while still modeling inter-codebook dependencies.

The delay pattern achieves the best quality/speed trade-off and is the default.

Text Conditioning

Text is encoded using a frozen T5 encoder. Cross-attention layers inject text embeddings into the transformer:

CrossAttn(Qaudio,Ktext,Vtext)\text{CrossAttn}(Q_{\text{audio}}, K_{\text{text}}, V_{\text{text}})

Classifier-free guidance is used during inference:

l^=(1+w)lcondwluncond\hat{\mathbf{l}} = (1 + w) \cdot \mathbf{l}_{\text{cond}} - w \cdot \mathbf{l}_{\text{uncond}}

Model Variants

VariantParametersQualitySpeed
MusicGen-Small300MGoodFast
MusicGen-Medium1.5BBetterModerate
MusicGen-Large3.3BBestSlower
MusicGen-Melody1.5B+ melody cond.Moderate

Melody Conditioning

MusicGen-Melody accepts a reference audio melody in addition to (or instead of) text. The melody is encoded using a chromagram representation:

mt=Chroma(xref,t)R12\mathbf{m}_t = \text{Chroma}(x_{\text{ref}}, t) \in \mathbb{R}^{12}

The 12-dimensional chroma vector captures pitch class distribution, allowing the model to follow a melody while generating new instrumentation and arrangement.

Training Details

  • Data: 20,000 hours of licensed music (Shutterstock, Pond5) + 10,000 hours of internal data
  • Sample rate: 32 kHz mono
  • Segment length: 30 seconds during training
  • Optimizer: AdamW, β1=0.9\beta_1 = 0.9, β2=0.95\beta_2 = 0.95
  • Learning rate: warmup + cosine decay

Evaluation

MusicGen outperformed MusicLM on FAD and human preference scores at the time of release:

MetricMusicGen-LargeMusicLM
FAD (VGGish)3.804.00
KL Divergence1.221.29
Human Overall QualityPreferredBaseline

Open Source

MusicGen is fully open-source:

  • Model weights available on Hugging Face
  • Training code in Meta's audiocraft repository
  • Inference supports CPU, GPU, and Apple Silicon

This openness made MusicGen the de facto baseline for text-to-music research.

Engineering Significance

MusicGen demonstrated that:

  1. A single transformer can handle multi-codebook generation efficiently
  2. Clever interleaving patterns eliminate the need for cascaded architectures
  3. Pre-trained text encoders (T5) transfer effectively to music conditioning
  4. Open-weight music models can achieve competitive quality