Mustango: Toward Controllable Text-to-Music Generation

Demo | Model | Website and Examples | Paper | Dataset

Meet Mustango, an exciting addition to the vibrant landscape of Multimodal Large Language Models designed for controlled music generation. Mustango leverages Latent Diffusion Model (LDM), Flan-T5, and musical features to do the magic!

🔥 Live demo available on Replicate and HuggingFace.

Quickstart Guide

Generate music from a text prompt:

import IPython
import soundfile as sf
from mustango import Mustango

model = Mustango("declare-lab/mustango")

prompt = "This is a new age piece. There is a flute playing the main melody with a lot of staccato notes. The rhythmic background consists of a medium tempo electronic drum beat with percussive elements all over the spectrum. There is a playful atmosphere to the piece. This piece can be used in the soundtrack of a children's TV show or an advertisement jingle."

music = model.generate(prompt)
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=music, rate=16000)

Installation

git clone https://github.com/AMAAI-Lab/mustango
cd mustango
pip install -r requirements.txt
cd diffusers
pip install -e .

Datasets

The MusicBench dataset contains 52k music fragments with a rich music-specific text caption.

Subjective Evaluation by Expert Listeners

Model	Dataset	Pre-trained	Overall Match ↑	Chord Match ↑	Tempo Match ↑	Audio Quality ↑	Musicality ↑	Rhythmic Presence and Stability ↑	Harmony and Consonance ↑
Tango	MusicCaps	✓	4.35	2.75	3.88	3.35	2.83	3.95	3.84
Tango	MusicBench	✓	4.91	3.61	3.86	3.88	3.54	4.01	4.34
Mustango	MusicBench	✓	5.49	5.76	4.98	4.30	4.28	4.65	5.18
Mustango	MusicBench	✗	5.75	6.06	5.11	4.80	4.80	4.75	5.59

Training

We use the accelerate package from Hugging Face for multi-gpu training. Run accelerate config from terminal and set up your run configuration by the answering the questions asked.

You can now train Mustango on the MusicBench dataset using:

accelerate launch train.py \
--text_encoder_name="google/flan-t5-large" \
--scheduler_name="stabilityai/stable-diffusion-2-1" \
--unet_model_config="configs/diffusion_model_config_munet.json" \
--model_type Mustango --freeze_text_encoder --uncondition_all --uncondition_single \
--drop_sentences --random_pick_text_column --snr_gamma 5 \

The --model_type flag allows to choose either Mustango, or Tango to be trained with the same code. However, do note that you also need to change --unet_model_config to the relevant config: diffusion_model_config_munet for Mustango; diffusion_model_config for Tango.

The arguments --uncondition_all, --uncondition_single, --drop_sentences control the dropout functions as per Section 5.2 in our paper. The argument of --random_pick_text_column allows to randomly pick between two input text prompts - in the case of MusicBench, we pick between ChatGPT rephrased captions and original enhanced MusicCaps prompts, as depicted in Figure 1 in our paper.

Recommended training time from scratch on MusicBench is at least 40 epochs.

Model Zoo

We have released the following models:

Mustango Pretrained: https://huggingface.co/declare-lab/mustango-pretrained

Mustango: https://huggingface.co/declare-lab/mustango

Citation

Please consider citing the following article if you found our work useful:

@inproceedings{melechovsky2024mustango,
  title={Mustango: Toward Controllable Text-to-Music Generation},
  author={Melechovsky, Jan and Guo, Zixun and Ghosal, Deepanway and Majumder, Navonil and Herremans, Dorien and Poria, Soujanya},
  booktitle={Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)},
  pages={8286--8309},
  year={2024}
}

Name	Name	Last commit message	Last commit date
Latest commit Dapwner Updated citation in readme Mar 20, 2025 d9354bd · Mar 20, 2025 History 100 Commits
audioldm	audioldm	Update data.py	Nov 19, 2023
audioldm_eval	audioldm_eval	upload: audioldm_eval	Nov 15, 2023
beats_model	beats_model	upload: inference, configs, beats_model, RM, Lsnce	Nov 15, 2023
configs	configs	code polishing	Nov 15, 2023
diffusers	diffusers	MusTango Inference and Demo	Nov 15, 2023
img	img	upd schematic	Mar 26, 2024
layers	layers	MusTango Inference and Demo	Nov 15, 2023
samples	samples	updated index and samples	Mar 25, 2024
tools	tools	upload: train, tango, model, utils, layers, tools	Nov 14, 2023
utils	utils	Delete utils/evaluation_generate_random_prompts.py	Jul 24, 2024
LICENSE	LICENSE	upload: inference, configs, beats_model, RM, Lsnce	Nov 15, 2023
README.md	README.md	Updated citation in readme	Mar 20, 2025
cog.yaml	cog.yaml	replicate	Nov 16, 2023
index.html	index.html	upd schematic	Mar 26, 2024
inference.py	inference.py	upload: inference, configs, beats_model, RM, Lsnce	Nov 15, 2023
modelling_deberta_v2.py	modelling_deberta_v2.py	MusTango Inference and Demo	Nov 15, 2023
models.py	models.py	Fix auto-download issue	Nov 21, 2023
mustango.py	mustango.py	replicate	Nov 16, 2023
predict.py	predict.py	replicate	Nov 16, 2023
requirements.txt	requirements.txt	Update requirements.txt	Nov 18, 2023
tango.py	tango.py	upload: train, tango, model, utils, layers, tools	Nov 14, 2023
train.py	train.py	code polishing	Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mustango: Toward Controllable Text-to-Music Generation

Quickstart Guide

Installation

Datasets

Subjective Evaluation by Expert Listeners

Training

Model Zoo

Citation

About

Releases

Packages

Contributors 8

Languages

License

AMAAI-Lab/mustango

Folders and files

Latest commit

History

Repository files navigation

Mustango: Toward Controllable Text-to-Music Generation

Quickstart Guide

Installation

Datasets

Subjective Evaluation by Expert Listeners

Training

Model Zoo

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

Packages