Skip to main content

Controllable Generation

Controllable generation enables fine-grained steering of AI music output beyond simple text prompts. This page covers the techniques that give users precise control over musical attributes like pitch, dynamics, timing, and style.

The Control Spectrum​

Control methods range from coarse to fine-grained:

Coarse ◀────────────────────────────────────────▢ Fine

Text prompt β†’ Style transfer β†’ Audio conditioning β†’ MIDI input β†’ Per-frame controls
Control LevelExamplePrecision
Text prompt"upbeat jazz"Low
Reference audio"like this song"Medium
Melody conditioningHum + generateMedium-high
Chord conditioningI-V-vi-IVHigh
MIDI controlNote-by-noteVery high
Frame-level controlEnergy, pitch per frameMaximum

Text-Based Control​

Prompt Engineering​

The baseline control method. Quality depends on:

  • Vocabulary alignment with training data
  • Specificity of descriptors
  • Consistency of prompt elements

See the Prompt Engineering Guide and Genre-Specific Prompting for detailed strategies.

Classifier-Free Guidance (CFG)​

Control the strength of text conditioning during inference:

Ο΅^=(1+w)ϡθ(xt,t,c)βˆ’w⋅ϡθ(xt,t,βˆ…)\hat{\epsilon} = (1 + w) \epsilon_\theta(x_t, t, c) - w \cdot \epsilon_\theta(x_t, t, \varnothing)
Guidance Scale wwEffect
0Unconditioned (ignores prompt)
1–3Balanced (natural, diverse)
5–7Strong adherence (less diverse)
10+Very literal (may reduce quality)

Negative Prompts​

Specify what to avoid:

Ο΅^=ϡθ(xt,t,cpos)+wβ‹…(ϡθ(xt,t,cpos)βˆ’Ο΅ΞΈ(xt,t,cneg))\hat{\epsilon} = \epsilon_\theta(x_t, t, c_{\text{pos}}) + w \cdot (\epsilon_\theta(x_t, t, c_{\text{pos}}) - \epsilon_\theta(x_t, t, c_{\text{neg}}))

Example: "jazz piano trio" (positive) + "electronic, synthesizer, drums" (negative) steers away from unwanted elements.

Audio Conditioning​

Style Transfer​

Use a reference audio clip to guide the generation:

cstyle=Eaudio(xref)\mathbf{c}_{\text{style}} = E_{\text{audio}}(x_{\text{ref}})

The style embedding captures timbre, production quality, and overall aesthetic without copying specific notes.

Melody Conditioning​

Provide a melody (humming, whistling, or MIDI) that the model follows:

cmelody=Chroma(xmelody)∈R12Γ—T\mathbf{c}_{\text{melody}} = \text{Chroma}(x_{\text{melody}}) \in \mathbb{R}^{12 \times T}

The chromagram captures pitch class over time, allowing the model to match the melody while generating its own arrangement.

MusicGen-Melody uses this approach with a reference audio input.

Audio Inpainting​

Regenerate only specific time segments while keeping the rest:

xoutput=mβŠ™xoriginal+(1βˆ’m)βŠ™GΞΈ(noise,c)x_{\text{output}} = m \odot x_{\text{original}} + (1-m) \odot G_\theta(\text{noise}, c)

where mm is a binary mask indicating which regions to keep.

In diffusion models, inpainting is natural:

  1. Add noise only to the masked region
  2. Denoise while conditioning on unmasked context
  3. Blend at boundaries

Audio-to-Audio Translation​

Transform existing audio while preserving structure:

xout=GΞΈ(xin,cstyle)x_{\text{out}} = G_\theta(x_{\text{in}}, c_{\text{style}})
  • Input: piano recording β†’ Output: orchestral arrangement
  • Input: rough demo β†’ Output: polished production
  • Input: electronic track β†’ Output: acoustic version

Structural Control​

Section Tags​

Explicit structural markers in prompts:

[Intro] ambient pad, gentle piano
[Verse] add drums, bass enters, vocal melody
[Chorus] full energy, all instruments, anthemic
[Bridge] stripped back, just piano and vocal
[Outro] gradual fade, reverb tails

Temporal Conditioning​

Provide per-section control signals:

c(t)=Interp(c1,c2;t)for t∈[t1,t2]\mathbf{c}(t) = \text{Interp}(\mathbf{c}_1, \mathbf{c}_2; t) \quad \text{for } t \in [t_1, t_2]

This allows smooth transitions between different conditioning states.

Energy Curves​

Specify an energy trajectory:

Energy
β”‚ β•±β€Ύβ€Ύβ•² β•±β€Ύβ€Ύβ€Ύβ•²
β”‚ β•± β•² β•± β•²
β”‚ β•± β•² β•± β•²
β”‚ β•± β•²β•± β•²___
└──────────────────────────── Time
intro build drop break drop outro

Map to a numerical signal that conditions the model frame-by-frame.

Musical Attribute Control​

Pitch / Key Control​

Force generation into a specific key:

ckey=one_hot(key)∈{0,1}24\mathbf{c}_{\text{key}} = \text{one\_hot}(\text{key}) \in \{0,1\}^{24}

(12 pitch classes Γ— 2 for major/minor)

Tempo Control​

Condition on exact BPM:

ctempo=embed(BPM)∈Rd\mathbf{c}_{\text{tempo}} = \text{embed}(BPM) \in \mathbb{R}^{d}

Some models support smooth tempo changes within a generation.

Instrument Control​

Specify which instruments should be present:

cinst=βˆ‘i∈activeembed(i)\mathbf{c}_{\text{inst}} = \sum_{i \in \text{active}} \text{embed}(i)

Or use multi-hot encoding over an instrument vocabulary.

Dynamics Control​

Per-frame loudness conditioning:

L(t)=target_loudness(t)LUFSL(t) = \text{target\_loudness}(t) \quad \text{LUFS}

Allows specifying crescendos, drops, and dynamic arcs precisely.

MIDI-Level Control​

MIDI-Conditioned Generation​

Provide a MIDI file as input, generate audio output:

x=GΞΈ(MIDI,cstyle)x = G_\theta(\text{MIDI}, c_{\text{style}})

This gives note-level control (pitch, timing, velocity) while the model handles:

  • Timbre (instrument sounds)
  • Production (effects, spatial positioning)
  • Expression (micro-timing, dynamics beyond MIDI velocity)

Advantages​

  • Exact pitch and rhythm control
  • Full arrangement control
  • Combine with style conditioning for versatile output

Challenges​

  • Requires MIDI input (not always available)
  • Expressiveness limited by MIDI representation
  • Model must generalize across MIDI β†’ audio mapping

Latent Space Manipulation​

Direct Latent Editing​

Modify specific dimensions of the latent representation:

zβ€²=z+Ξ±β‹…dattribute\mathbf{z}' = \mathbf{z} + \alpha \cdot \mathbf{d}_{\text{attribute}}

where dattribute\mathbf{d}_{\text{attribute}} is a direction in latent space corresponding to a musical attribute.

Finding directions:

  • Linear probes: train a linear classifier on labeled data
  • Contrastive pairs: compute the difference between "with" and "without" attribute
  • PCA of attribute subspace: find the principal direction of variation

Interpolation for Smooth Transitions​

Blend between two musical states:

z(t)=slerp(zA,zB;t)\mathbf{z}(t) = \text{slerp}(\mathbf{z}_A, \mathbf{z}_B; t)

Useful for:

  • Crossfading between styles
  • Gradual tempo/energy changes
  • Morphing between instruments

Control in Practice​

Building a Control Interface​

Effective control UIs for music generation:

ControlUI ElementMapping
GenreDropdown / tagsText embedding
BPMSlider (60–200)Tempo conditioning
EnergyCurve editorFrame-level energy
KeyDropdown (C Major, etc.)Key embedding
InstrumentsCheckboxesMulti-hot / text
DurationSlider (5–300s)Duration conditioning
GuidanceSlider (1–15)CFG scale
StructureSection editorTemporal segmented prompts

Control Hierarchy​

For best results, layer controls from coarse to fine:

  1. Genre and style (text prompt) β€” broadest
  2. Tempo and key (numerical) β€” structural constraints
  3. Instruments (text/selection) β€” sonic palette
  4. Structure (section tags) β€” arrangement
  5. Energy curve (per-frame) β€” dynamics
  6. Melody/MIDI (note-level) β€” pitch content

Coarser controls should be set first; fine-grained controls refine within the space defined by coarser ones.