Controllable Generation
Controllable generation enables fine-grained steering of AI music output beyond simple text prompts. This page covers the techniques that give users precise control over musical attributes like pitch, dynamics, timing, and style.
The Control Spectrum
Control methods range from coarse to fine-grained:
Coarse ◀────────────────────────────────────────▶ Fine
Text prompt → Style transfer → Audio conditioning → MIDI input → Per-frame controls
| Control Level | Example | Precision |
|---|---|---|
| Text prompt | "upbeat jazz" | Low |
| Reference audio | "like this song" | Medium |
| Melody conditioning | Hum + generate | Medium-high |
| Chord conditioning | I-V-vi-IV | High |
| MIDI control | Note-by-note | Very high |
| Frame-level control | Energy, pitch per frame | Maximum |
Text-Based Control
Prompt Engineering
The baseline control method. Quality depends on:
- Vocabulary alignment with training data
- Specificity of descriptors
- Consistency of prompt elements
See the Prompt Engineering Guide and Genre-Specific Prompting for detailed strategies.
Classifier-Free Guidance (CFG)
Control the strength of text conditioning during inference:
| Guidance Scale | Effect |
|---|---|
| 0 | Unconditioned (ignores prompt) |
| 1–3 | Balanced (natural, diverse) |
| 5–7 | Strong adherence (less diverse) |
| 10+ | Very literal (may reduce quality) |
Negative Prompts
Specify what to avoid:
Example: "jazz piano trio" (positive) + "electronic, synthesizer, drums" (negative) steers away from unwanted elements.
Audio Conditioning
Style Transfer
Use a reference audio clip to guide the generation:
The style embedding captures timbre, production quality, and overall aesthetic without copying specific notes.
Melody Conditioning
Provide a melody (humming, whistling, or MIDI) that the model follows:
The chromagram captures pitch class over time, allowing the model to match the melody while generating its own arrangement.
MusicGen-Melody uses this approach with a reference audio input.
Audio Inpainting
Regenerate only specific time segments while keeping the rest:
where is a binary mask indicating which regions to keep.
In diffusion models, inpainting is natural:
- Add noise only to the masked region
- Denoise while conditioning on unmasked context
- Blend at boundaries
Audio-to-Audio Translation
Transform existing audio while preserving structure:
- Input: piano recording → Output: orchestral arrangement
- Input: rough demo → Output: polished production
- Input: electronic track → Output: acoustic version
Structural Control
Section Tags
Explicit structural markers in prompts:
[Intro] ambient pad, gentle piano
[Verse] add drums, bass enters, vocal melody
[Chorus] full energy, all instruments, anthemic
[Bridge] stripped back, just piano and vocal
[Outro] gradual fade, reverb tails
Temporal Conditioning
Provide per-section control signals:
This allows smooth transitions between different conditioning states.
Energy Curves
Specify an energy trajectory:
Energy
│ ╱‾‾╲ ╱‾‾‾╲
│ ╱ ╲ ╱ ╲
│ ╱ ╲ ╱ ╲
│ ╱ ╲╱ ╲___
└──────────────────────────── Time
intro build drop break drop outro
Map to a numerical signal that conditions the model frame-by-frame.
Musical Attribute Control
Pitch / Key Control
Force generation into a specific key:
(12 pitch classes × 2 for major/minor)
Tempo Control
Condition on exact BPM:
Some models support smooth tempo changes within a generation.
Instrument Control
Specify which instruments should be present:
Or use multi-hot encoding over an instrument vocabulary.
Dynamics Control
Per-frame loudness conditioning:
Allows specifying crescendos, drops, and dynamic arcs precisely.
MIDI-Level Control
MIDI-Conditioned Generation
Provide a MIDI file as input, generate audio output:
This gives note-level control (pitch, timing, velocity) while the model handles:
- Timbre (instrument sounds)
- Production (effects, spatial positioning)
- Expression (micro-timing, dynamics beyond MIDI velocity)
Advantages
- Exact pitch and rhythm control
- Full arrangement control
- Combine with style conditioning for versatile output
Challenges
- Requires MIDI input (not always available)
- Expressiveness limited by MIDI representation
- Model must generalize across MIDI → audio mapping
Latent Space Manipulation
Direct Latent Editing
Modify specific dimensions of the latent representation:
where is a direction in latent space corresponding to a musical attribute.
Finding directions:
- Linear probes: train a linear classifier on labeled data
- Contrastive pairs: compute the difference between "with" and "without" attribute
- PCA of attribute subspace: find the principal direction of variation
Interpolation for Smooth Transitions
Blend between two musical states:
Useful for:
- Crossfading between styles
- Gradual tempo/energy changes
- Morphing between instruments
Control in Practice
Building a Control Interface
Effective control UIs for music generation:
| Control | UI Element | Mapping |
|---|---|---|
| Genre | Dropdown / tags | Text embedding |
| BPM | Slider (60–200) | Tempo conditioning |
| Energy | Curve editor | Frame-level energy |
| Key | Dropdown (C Major, etc.) | Key embedding |
| Instruments | Checkboxes | Multi-hot / text |
| Duration | Slider (5–300s) | Duration conditioning |
| Guidance | Slider (1–15) | CFG scale |
| Structure | Section editor | Temporal segmented prompts |
Control Hierarchy
For best results, layer controls from coarse to fine:
- Genre and style (text prompt) — broadest
- Tempo and key (numerical) — structural constraints
- Instruments (text/selection) — sonic palette
- Structure (section tags) — arrangement
- Energy curve (per-frame) — dynamics
- Melody/MIDI (note-level) — pitch content
Coarser controls should be set first; fine-grained controls refine within the space defined by coarser ones.