Fine-Tuning and Adaptation for Audio Models
Pre-trained audio models can be adapted to specific styles, instruments, artists, or use cases through fine-tuning. This page covers the practical techniques for customizing audio AI models.
Why Fine-Tune?β
Pre-trained models are general-purpose. Fine-tuning specializes them:
| Goal | Example |
|---|---|
| Genre specialization | Fine-tune for jazz or classical |
| Instrument focus | Specialize in guitar or piano |
| Style matching | Match a specific production aesthetic |
| Language/accent | Adapt singing voice to a specific language |
| Quality improvement | Fine-tune on curated high-quality data |
| Task adaptation | Adapt generation model for editing or inpainting |
Full Fine-Tuningβ
Update all model parameters on new data:
When to Useβ
- Sufficient fine-tuning data available (hundreds to thousands of examples)
- Significant domain shift from pre-training data
- Compute budget allows full training
Risksβ
- Catastrophic forgetting: model loses general capabilities
- Overfitting: memorizes small fine-tuning dataset
- Compute cost: same as pre-training per step
Mitigationβ
- Lower learning rate: 10β100Γ smaller than pre-training ( to )
- Short training: fewer epochs (5β50)
- Regularization: weight decay, dropout, early stopping
- Data augmentation: expand small datasets (see Data Augmentation)
Parameter-Efficient Fine-Tuning (PEFT)β
Update only a small fraction of parameters while freezing the rest.
LoRA (Low-Rank Adaptation)β
The most popular PEFT technique. Decomposes weight updates as low-rank matrices:
where , , and .
During training: freeze , only train and .
Parameter savings:
For , : LoRA uses 32K parameters vs. 1M for full matrix β 32Γ fewer.
LoRA for Audioβ
Apply LoRA to attention layers:
| LoRA Rank () | Quality | Params Added | Use Case |
|---|---|---|---|
| 4 | Modest adaptation | Very few | Subtle style shifts |
| 16 | Good adaptation | Few | Genre/instrument focus |
| 64 | Strong adaptation | Moderate | Significant domain change |
| 128+ | Near full fine-tune | Many | Maximum flexibility |
QLoRAβ
Combine LoRA with quantization for memory efficiency:
- Quantize pre-trained model to 4-bit (NF4 quantization)
- Add LoRA adapters in full precision
- Train only the adapters
This allows fine-tuning large models on consumer GPUs:
- 3.3B parameter model: ~8 GB VRAM with QLoRA
- vs. ~26 GB for full fine-tuning in FP16
Adaptersβ
Insert small trainable modules between frozen layers:
where is a small feedforward network (e.g., down-project β nonlinearity β up-project).
with and .
Prefix Tuningβ
Prepend learnable "prefix" tokens to the input:
The prefix tokens act as a task-specific prompt that's optimized through gradient descent.
Textual Inversionβ
Learn new "concepts" represented as text tokens:
For audio: learn a new text embedding for a specific instrument, vocalist, or production style. The new token can then be used in prompts alongside regular text.
Example: train a token [my_guitar] on recordings of a specific guitar, then prompt: "Jazz melody with [my_guitar], gentle drums, upright bass."
DreamBooth for Audioβ
Fine-tune the entire model while associating a rare token with a specific concept:
- Pick a rare token (e.g.,
[V]) - Fine-tune on examples of the target concept labeled with
[V] - Use
[V]in prompts to invoke the learned concept
With prior preservation: also train on general data to prevent forgetting:
Fine-Tuning Recipesβ
Recipe 1: Genre Specialization with LoRAβ
1. Collect 100β1000 high-quality tracks in target genre
2. Segment into 10β30 second clips
3. Annotate with descriptive captions
4. Fine-tune with LoRA (r=32) on attention layers
5. Use AdamW, lr=1e-4, 500β2000 steps
6. Evaluate with genre-specific FAD and human listening
Recipe 2: Voice Adaptationβ
1. Collect 5β30 minutes of target voice recordings
2. Ensure clean, unaccompanied recordings
3. Extract speaker embedding as conditioning
4. Fine-tune decoder/vocoder with LoRA
5. Use lr=1e-5, 200β1000 steps
6. Evaluate with MOS and speaker similarity
Recipe 3: Production Style Transferβ
1. Collect 50β200 tracks with target production style
2. Include diverse musical content within the style
3. Caption with style-specific descriptors
4. Fine-tune with LoRA (r=16) for 300β1000 steps
5. Use textual inversion for style token
6. Evaluate A/B against base model
Training Data Requirementsβ
| Technique | Minimum Data | Best Data | Quality Sensitivity |
|---|---|---|---|
| Full fine-tuning | 500+ tracks | 5000+ | Moderate |
| LoRA | 50β200 tracks | 500+ | High |
| Textual inversion | 10β50 examples | 50β100 | Very high |
| DreamBooth | 10β30 examples | 30β100 | Very high |
Data quality matters more than quantity for fine-tuning. A small set of excellent examples outperforms a large set of mediocre ones.
Common Pitfallsβ
| Pitfall | Symptom | Fix |
|---|---|---|
| Learning rate too high | Quality collapses quickly | Reduce LR 10Γ |
| Training too long | Outputs all sound the same | Stop earlier, use validation |
| Data too homogeneous | Model overfits to specific patterns | Add diversity |
| Catastrophic forgetting | Loses general capability | Use LoRA, lower LR, or replay |
| Captions too generic | Adapter doesn't specialize | Write detailed, specific captions |
Serving Fine-Tuned Modelsβ
LoRA Weight Mergingβ
For deployment, merge LoRA weights into the base model:
This produces a standard model with no inference overhead.
Multiple LoRA Switchingβ
Serve the base model with swappable LoRA adapters:
Base Model + LoRA_jazz βββΆ Jazz output
Base Model + LoRA_classical βββΆ Classical output
Base Model + LoRA_electronic βββΆ Electronic output
Only the small adapter weights change, not the full model.
LoRA Interpolationβ
Blend between styles:
where interpolates between two fine-tuned styles.