Controllable Generation
Controllable generation enables fine-grained steering of AI music output beyond simple text prompts. This page covers the techniques that give users precise control over musical attributes like pitch, dynamics, timing, and style.
The Control Spectrumβ
Control methods range from coarse to fine-grained:
Coarse ββββββββββββββββββββββββββββββββββββββββββΆ Fine
Text prompt β Style transfer β Audio conditioning β MIDI input β Per-frame controls
| Control Level | Example | Precision |
|---|---|---|
| Text prompt | "upbeat jazz" | Low |
| Reference audio | "like this song" | Medium |
| Melody conditioning | Hum + generate | Medium-high |
| Chord conditioning | I-V-vi-IV | High |
| MIDI control | Note-by-note | Very high |
| Frame-level control | Energy, pitch per frame | Maximum |
Text-Based Controlβ
Prompt Engineeringβ
The baseline control method. Quality depends on:
- Vocabulary alignment with training data
- Specificity of descriptors
- Consistency of prompt elements
See the Prompt Engineering Guide and Genre-Specific Prompting for detailed strategies.
Classifier-Free Guidance (CFG)β
Control the strength of text conditioning during inference:
| Guidance Scale | Effect |
|---|---|
| 0 | Unconditioned (ignores prompt) |
| 1β3 | Balanced (natural, diverse) |
| 5β7 | Strong adherence (less diverse) |
| 10+ | Very literal (may reduce quality) |
Negative Promptsβ
Specify what to avoid:
Example: "jazz piano trio" (positive) + "electronic, synthesizer, drums" (negative) steers away from unwanted elements.
Audio Conditioningβ
Style Transferβ
Use a reference audio clip to guide the generation:
The style embedding captures timbre, production quality, and overall aesthetic without copying specific notes.
Melody Conditioningβ
Provide a melody (humming, whistling, or MIDI) that the model follows:
The chromagram captures pitch class over time, allowing the model to match the melody while generating its own arrangement.
MusicGen-Melody uses this approach with a reference audio input.
Audio Inpaintingβ
Regenerate only specific time segments while keeping the rest:
where is a binary mask indicating which regions to keep.
In diffusion models, inpainting is natural:
- Add noise only to the masked region
- Denoise while conditioning on unmasked context
- Blend at boundaries
Audio-to-Audio Translationβ
Transform existing audio while preserving structure:
- Input: piano recording β Output: orchestral arrangement
- Input: rough demo β Output: polished production
- Input: electronic track β Output: acoustic version
Structural Controlβ
Section Tagsβ
Explicit structural markers in prompts:
[Intro] ambient pad, gentle piano
[Verse] add drums, bass enters, vocal melody
[Chorus] full energy, all instruments, anthemic
[Bridge] stripped back, just piano and vocal
[Outro] gradual fade, reverb tails
Temporal Conditioningβ
Provide per-section control signals:
This allows smooth transitions between different conditioning states.
Energy Curvesβ
Specify an energy trajectory:
Energy
β β±βΎβΎβ² β±βΎβΎβΎβ²
β β± β² β± β²
β β± β² β± β²
β β± β²β± β²___
βββββββββββββββββββββββββββββ Time
intro build drop break drop outro
Map to a numerical signal that conditions the model frame-by-frame.
Musical Attribute Controlβ
Pitch / Key Controlβ
Force generation into a specific key:
(12 pitch classes Γ 2 for major/minor)
Tempo Controlβ
Condition on exact BPM:
Some models support smooth tempo changes within a generation.
Instrument Controlβ
Specify which instruments should be present:
Or use multi-hot encoding over an instrument vocabulary.
Dynamics Controlβ
Per-frame loudness conditioning:
Allows specifying crescendos, drops, and dynamic arcs precisely.
MIDI-Level Controlβ
MIDI-Conditioned Generationβ
Provide a MIDI file as input, generate audio output:
This gives note-level control (pitch, timing, velocity) while the model handles:
- Timbre (instrument sounds)
- Production (effects, spatial positioning)
- Expression (micro-timing, dynamics beyond MIDI velocity)
Advantagesβ
- Exact pitch and rhythm control
- Full arrangement control
- Combine with style conditioning for versatile output
Challengesβ
- Requires MIDI input (not always available)
- Expressiveness limited by MIDI representation
- Model must generalize across MIDI β audio mapping
Latent Space Manipulationβ
Direct Latent Editingβ
Modify specific dimensions of the latent representation:
where is a direction in latent space corresponding to a musical attribute.
Finding directions:
- Linear probes: train a linear classifier on labeled data
- Contrastive pairs: compute the difference between "with" and "without" attribute
- PCA of attribute subspace: find the principal direction of variation
Interpolation for Smooth Transitionsβ
Blend between two musical states:
Useful for:
- Crossfading between styles
- Gradual tempo/energy changes
- Morphing between instruments
Control in Practiceβ
Building a Control Interfaceβ
Effective control UIs for music generation:
| Control | UI Element | Mapping |
|---|---|---|
| Genre | Dropdown / tags | Text embedding |
| BPM | Slider (60β200) | Tempo conditioning |
| Energy | Curve editor | Frame-level energy |
| Key | Dropdown (C Major, etc.) | Key embedding |
| Instruments | Checkboxes | Multi-hot / text |
| Duration | Slider (5β300s) | Duration conditioning |
| Guidance | Slider (1β15) | CFG scale |
| Structure | Section editor | Temporal segmented prompts |
Control Hierarchyβ
For best results, layer controls from coarse to fine:
- Genre and style (text prompt) β broadest
- Tempo and key (numerical) β structural constraints
- Instruments (text/selection) β sonic palette
- Structure (section tags) β arrangement
- Energy curve (per-frame) β dynamics
- Melody/MIDI (note-level) β pitch content
Coarser controls should be set first; fine-grained controls refine within the space defined by coarser ones.